[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->] |
It's quite common to search through text for a given sequence of characters (such as a word or phrase), called a string, or even for a pattern describing a set of such strings; this chapter contains recipes for doing these kind of things.
The primary command used for searching through text is the rather
froglike-sounding tool called grep
(the origin of its name is
explained in Regular Expressions -- Matching Text Patterns, where its advanced usage is discussed). It outputs lines of
its input that contain a given string or pattern.
To search for a word, give that word as the first argument. By default,
grep
searches standard input; give the name of a file to search
as the second argument.
$ grep CD catalog [RET]
To search for a phrase, specify it in quotes.
$ grep 'Compact Disc' catalog [RET]
The preceding example outputs all lines in the file `catalog' that contain the exact string `Compact Disc'; it will not match, however, lines containing `compact disc' or any other variation on the case of letters in the search pattern. Use the `-i' option to specify that matches are to be made regardless of case.
$ grep -i 'compact disc' catalog [RET]
This command outputs lines in the file `catalog' containing any variation of the pattern `compact disc', including `Compact Disc', `COMPACT DISC', and `comPact dIsC'.
One thing to keep in mind is that grep
only matches patterns that
appear on a single line, so in the preceding example, if one line in
`catalog' ends with the word `compact' and the next begins
with `disc', grep
will not match either line. There is a way
around this with grep
(see Finding Phrases Regardless of Spacing), or you can search the text in Emacs
(see Searching for a Phrase in Emacs).
You can specify more than one file to search. When you specify multiple
files, each match that grep
outputs is preceded by the name of
the file it's in (and you can suppress this with the `-h' option.)
$ grep CD * [RET]
$ grep -h CD ~/doc/*.txt [RET]
Use the `-r' option to search a given directory recursively, searching all subdirectories it contains.
$ grep -r CD ~/doc/*.txt [RET]
NOTE: There are more complex things you can search for than simple strings, as will be explained in the next section.
In addition to word and phrase searches, you can use grep
to
search for complex text patterns called regular expressions. A
regular expression -- or "regexp"---is a text string of special
characters that specifies a set of patterns to match.
Technically speaking, the word or phrase patterns described in the previous section are regular expressions -- just very simple ones. In a regular expression, most characters -- including letters and numbers -- represent themselves. For example, the regexp pattern 1 matches the string `1', and the pattern bee matches the string `bee'.
There are a number of reserved characters called metacharacters that don't represent themselves in a regular expression, but have a special meaning that is used to build complex patterns. These metacharacters are as follows: ., *, [, ], ^, $, and \.
To specify one of these literal characters in a regular expression, precede the character with a `\'.
$ grep '\$' catalog [RET]
$ grep '\$1\.99' catalog [RET]
$ grep '\\' catalog [RET]
The following table describes the special meanings of the metacharacters and gives examples of their usage.
METACHARACTER | MEANING | |
. |
Matches any one character, with the exception of the newline character. For example, . matches `a', `1', `?', `.' (a literal period character), and so forth. | |
* |
Matches the preceding regexp zero or more times. For example, -* matches `-', `--', `---', `---------', and so forth. Now imagine a line of text with a million `-' characters somewhere in it, all marching off across the horizon, up into the blue sky, and through the clouds. A million `-' characters in a row. This pattern would match it. Now think of the same long parade, but it's a million and one `-' characters -- it matches that, too. | |
[ ] |
Encloses a character set, and matches any member of the set -- for example, [abc] matches either `a', `b', or `c'. In addition, the hyphen (`-') and caret (`^') characters have special meanings when used inside brackets: | |
- | The hyphen specifies a range of characters, ordered according to their ASCII value (see Viewing a Character Chart). For example, [0-9] is synonymous with [0123456789]; [A-Za-z] matches one uppercase or lowercase letter. To include a literal `-' in a list, specify it as the last character in a list: so [0-9-] matches either a single digit character or a `-'.x | |
^ | As the first character of a list, the caret means that any character except those in the list should be matched. For example, [^a] matches any character except `a', and [^0-9] matches any character except a numeric digit. | |
^ |
Matches the beginning of the line. So ^a matches `a' only when it is the first character on a line. | |
$ |
Matches the end of the line. So a$ matches `a' only when it is the last character on a line. | |
\ |
Use \ before a metacharacter when you want to specify that literal character. So \$ matches a dollar sign character (`$'), and \\ matches a single backslash character (`\'). | |
In addition, use \ to build new | metacharacters, by using it before a number of other characters:||
\| | Called the `alternation operator'; it | matches either regexp it is between -- use it to join two separate regexps to match either of them. For example, a\|b matches either `a' or `b'.|
\+ | Matches the preceding regexp as many times as | possible, but at least once. So a\+ matches one or more `a' adjacent characters, such as `aaa', `aa', and `a'.|
\? | Matches the regexp preceding it either zero or | one times. So a\? matches `a' or an empty string -- which matches every line.|
\{number\} | Matches the previous regexp | (one specified to the left of this construction) that number of times -- so a\{4\} matches `aaaa'. Use \{number,\} to match the preceding regexp number or more times, \{,number\} to match the preceding regexp zero to number times, and \{number1,number2\} to match the preceding regexp from number1 to number2 times.|
\(regexp\) | Group regexp together for | an alternative; useful for combination regexps. For example, while moo\? matches `mo' or `moo', \(moo\)\? matches `moo' or the empty set.
ed
line editor tool -- the ed
command for
searching globally through a file for a regular expression
and then printing those lines was g/re/p, where re
was the regular expression you'd use. Eventually, the grep
command was written to do this search on a file when not using
ed
.(22)
The following sections describe some regexp recipes for commonly
searched-for patterns.
Use `^' in a regexp to denote the beginning of a line.
$ grep '^pre' /usr/dict/words [RET]
$ grep -i '^in the beginning' book [RET]
NOTE: These regexps were quoted with ' characters; this is because some shells otherwise treat the `^' character as a special "metacharacter" (see Passing Special Characters to Commands).(23)
Use `$' as the last character of quoted text to match that text only at the end of a line.
$ grep '!$' sayings [RET]
NOTE: To use `$' in a regexp to find words that rhyme with a given word, see Listing Words that Match a Pattern.
To match lines of a particular length, use that number of `.' characters between `^' and `$'---for example, to match all lines that are two characters (or columns) wide, use `^..$' as the regexp to search for.
$ grep '^..$' /usr/dict/words [RET]
For longer lines, it is more useful to use a different construct: `^.\{number\}$', where number is the number of lines to match. Use `,' to specify a range of numbers.
$ grep '^.\{17\}$' /usr/dict/words [RET]
$ grep '^.\{25,\}$' /usr/dict/words [RET]
To match lines that contain any of a number of regexps, specify each of the regexps to search for between alternation operators (`\|') as the regexp to search for. Lines containing any of the given regexps will be output.
$ grep 'the sea\|cake' playlist [RET]
This command outputs any lines in `playlist' that match the patterns `the sea' or `cake', including lines matching both patterns.
To output lines that match all of a number of regexps, use
grep
to output lines containing the first regexp you want to
match, and pipe the output to a grep
with the second regexp as an
argument. Continue adding pipes to grep
searches for all the
regexps you want to search for.
$ grep -i 'the sea' playlist | grep -i cake [RET]
NOTE: To match lines containing some regexps in a particular order, see Regexps for Common Situations.
To output all lines in a text that don't contain a given pattern,
use grep
with the `-v' option -- this option reverts the
sense of matching, selecting all non-matching lines.
$ grep -v '^...$' [RET]
$ grep -v http access_log [RET]
To match lines that only contain certain characters, use the regexp `^[characters]*$', where characters are the ones to match.
$ grep -i '^[aeiou]*$' /usr/dict/words [RET]
The `-i' option matches characters regardless of case; so, in this example, all vowel characters are matched regardless of case.
One way to search for a phrase that might occur with extra spaces
between words, or across a line or page break, is to remove all
linefeeds and extra spaces from the input, and then grep
that.
To do this, pipe the input(24) to tr
with
`'\r\n:\>\|-'' as an argument to the `-d' option (removing all
linebreaks from the input); pipe that to the fmt
filter with the
`-u' option (outputting the text with uniform spacing); and pipe
that to grep
with the pattern to search for.
$ cat notes | tr -d '\r\n:\>\|-' | fmt -u | grep 'at the same time as' [RET]
NOTE: The Emacs editor has its own special search for doing this -- see Searching for a Phrase in Emacs.
To search for a pattern that only occurs in a particular context,
grep
for the context in which it should occur, and pipe the
output to another grep
to search for the actual pattern.
For example, this can be useful to search for a given pattern only when it is quoted with an `>' character in an email message.
$ grep '^>' email-archive | grep narrative [RET]
You can also reverse the order and use the `-v' option to output all lines containing a given pattern that are not in a given context.
$ grep narrative email-archive | grep -v '^>' [RET]
You can keep a list of regexps in a file, and use grep
to search
text for any of the patterns in the file. To do this, specify the name
of the file containing the regexps to search for as an argument to the
`-f' option.
This can be useful, for example, if you need to search a given text for a number of words -- keep each word on its own line in the regexp file.
$ grep -f forbidden-words /usr/dict/words [RET]
$ grep -v -i -f forbidden-words /usr/dict/words [RET]
The following table lists sample regexps and describes what they match. You can use these regexps as boilerplate when building your own regular expressions for searching text. Remember to enclose regexps in quotes.
TO MATCH LINES THAT ... | USE THIS REGEXP |
contain nine zeroes in a row | 0\{9\} |
are exactly four characters long | ^....$ or ^.\{4\}$ |
are exactly seventy characters long | ^.\{70\}$ |
begin with an asterisk character | ^\* |
begin with `tow' and end with `ing' | ^tow.*ing$ |
contain a number | [0-9] |
do not contain a number | ^[^0-9]*$ |
contain a year from 1991 through 1995 | 199[1-5] |
contain a year from 1957 through 1969 | \(195[7-9]\)\|\(196[0-9]\) |
contain either `.txt' or `.text' | \.te\?xt |
contain `cat' then `gory' in the same word | cat\.\+gory |
contain `cat' then `gory' in the same line | cat\.\+\?gory |
contain a `q' not followed by a `u' | q[^u] |
contain any ftp , gopher , or `http' URLs | \(ftp\|gopher\|http\|\)://.*\..* |
contain `N', `T', and `K', with zero or more characters between each | N.*T.*K |
The following recipes are for searching data other than in plain text files.
Use zgrep
to search through text in files that are
compressed. These files usually have a `.gz' file name
extension, and can't be searched or otherwise read by other tools
without uncompressing the file first (for more about compressed files,
see Compressed Files).
The zgrep
tool works just like grep
, except it searches
through the text of compressed files. It outputs matches to the given
pattern as if you'd searched through normal, uncompressed files. It
leaves the files compressed when it exits.
$ zgrep Linux README.gz [RET]
You can grep
a Web page or other URL by giving the URL to
lynx
with the `-dump' option, and piping the output to
grep
.
$ lynx -dump http://example.com/ | grep 'gonzo\|hunter' [RET]
It is sometimes useful to see a matched line in its context in the file -- that is, to see some of the lines that surround it.
Use the `-C' option with grep
to output results in
context---it outputs matched lines with two lines of "context"
both before and after each match. To specify the number of context
lines output both before and after matched lines, use that number as an
option instead of `-C'.
$ grep -C tsch /usr/dict/words [RET]
$ grep -6 tsch /usr/dict/words [RET]
To output matches and the lines before them, use `-B'; to output matches and the lines after them, use `-A'. Give a numeric option with either of these options to specify that number of context lines.
$ grep -B tsch /usr/dict/words [RET]
$ grep -A6 tsch /usr/dict/words [RET]
$ grep -B10 -A3 tsch /usr/dict/words [RET]
A quick way to search and replace some text in a file is to use the
following one-line perl
command:
$ perl -pi -e "s/oldstring/newstring/g;" filespec [RET]
In this example, oldstring is the string to search, newstring is the string to replace it with, and filespec is the name of the file or files to work on. You can use this for more than one file.
$ perl -pi -e "s/helpless/helpful/g;" * [RET]
You can also search and replace text in an Emacs buffer; to do this, use
the replace-regexp
function and give both the expression to
search for and the expression to replace it with.
M-x replace-regexp [RET] helpless [RET] helpful [RET]
NOTE: You can also search and replace text in most text editors, including Emacs; see Searching and Replacing in Emacs.
The following sections show ways of searching for text in Emacs -- incrementally, for a word or phrase, or for a pattern -- and for searching and then replacing text.
Type C-s to use the Emacs incremental search function. It takes text as input in the minibuffer and it searches for that text from point toward the end of the current buffer. Type C-s again to search for the next occurrence of the text you're searching for; this works until no more matches occur. Then, Emacs reports `Failing I-search' in the minibuffer; type C-s again to wrap to the beginning of the buffer and continue the search from there.
It gets its name "incremental" because it begins searching immediately when you start to type text, and so it builds a search string in increments---for example, if you want to search for the word `sunflower' in the current buffer, you start to type
C-s s
At that point Emacs searches forward through the buffer to the first `s' character, and highlights it. Then, as you type u, it searches forward to the first `su' in the buffer and highlights that (if a `u' appears immediately after the `s' it first stopped at, it stays where it's at, and highlights the `s' and the `u'). It continues to do this as long as you type and as long as there is a match in the current buffer. As soon as what you type does not appear in the buffer, Emacs beeps and a message appears in the minibuffer stating that the search has failed.
To search for the next instance of the last string you gave, type C-s again; if you keep [CTRL] held down, every time you press the [S] key, Emacs will advance to the next match in the buffer.
This is generally the fastest and most common type of search you will use in Emacs.
You can do an incremental search through the buffer in
reverse---that is, from point to the beginning of the
buffer -- with the isearch-backward
function, C-r.
C-r moon
Like grep
, the Emacs incremental search only works on lines of
text, so it only finds phrases on a single line. If you search for
`hello, world' with the incremental search and the text
`hello,' appears at the end of a line and the text `world'
appears at the beginning of the next line, it won't find it.
To find a multi-word phrase across line breaks, use the
word-search-forward
function. It searches for a phrase or words
regardless of punctuation or spacing.
M-x word-search-forward [RET] join me [RET]
NOTE: The word-search-backward
function does the same as
word-search-forward
, except it searches backward through
the buffer, from point to the beginning of the buffer.
Use the search-forward-regexp
function to search for a regular
expression from point to the end of the current buffer.
M-x search-forward-regexp [RET] @.*\.org [RET]
The keyboard accelerator for this command is M-C-s---on most keyboards, you press and release [ESC] and then hold down [CTRL] while you type s. To repeat the last regexp search you made, type M-C-s C-s; then, as long as you have [CTRL] held down, you can keep typing s to advance to the next match, just as you would with an incremental search.
NOTE: There is a search-backward-regexp
function that is
identical but searches backward, from point to the top of the buffer.
To search for and replace text in Emacs, use the replace-regexp
function. When you run this function, Emacs will ask for both the text
or regexp to search for and the text to replace it with.
M-x replace-regexp [RET] day [RET] night [RET]
This function is especially useful for replacing control characters with
text, or for replacing text with control characters, which you can
specify with C-q, the quoted-insert
function (see Inserting Special Characters in Emacs).
M-x replace-regexp [RET] C-q C-m [RET] C-q 012 [RET] [RET]
There are two useful commands in less
for searching through text:
/ and ?. To search forward through the text, type
/ followed by a regexp to search for; to search backward
through the text, use ?.
When you do a search, the word or other regexp you search for appears highlighted throughout the text.
/cat [RET]To search backward through the text you are perusing for the regexp `[ch]at', type:
?[ch]at [RET]
[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->]