[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->]

Web dsl.org

Searching Text

It's quite common to search through text for a given sequence of characters (such as a word or phrase), called a string, or even for a pattern describing a set of such strings; this chapter contains recipes for doing these kind of things.

Word Search: Finding a word or phrase.
Regexps: How to specify and find patterns.
Nontext Search: Searching in other than text files.
Context Search: Searching in a certain context.
Search and Replace: Searching and replacing text.
Emacs Search: Searching in Emacs.
Less Search: Searching in less.

Searching for a Word or Phrase

The primary command used for searching through text is the rather froglike-sounding tool called grep (the origin of its name is explained in Regular Expressions -- Matching Text Patterns, where its advanced usage is discussed). It outputs lines of its input that contain a given string or pattern.

To search for a word, give that word as the first argument. By default, grep searches standard input; give the name of a file to search as the second argument.

To output lines in the file `catalog' containing the word `CD', type:
```
$ grep CD catalog [RET]
```

To search for a phrase, specify it in quotes.

To output lines in the file `catalog' containing the word `Compact Disc', type:
```
$ grep 'Compact Disc' catalog [RET]
```

The preceding example outputs all lines in the file `catalog' that contain the exact string `Compact Disc'; it will not match, however, lines containing `compact disc' or any other variation on the case of letters in the search pattern. Use the `-i' option to specify that matches are to be made regardless of case.

To output lines in the file `catalog' containing the string `compact disc' regardless of the case of its letters, type:
```
$ grep -i 'compact disc' catalog [RET]
```

This command outputs lines in the file `catalog' containing any variation of the pattern `compact disc', including `Compact Disc', `COMPACT DISC', and `comPact dIsC'.

One thing to keep in mind is that grep only matches patterns that appear on a single line, so in the preceding example, if one line in `catalog' ends with the word `compact' and the next begins with `disc', grep will not match either line. There is a way around this with grep (see Finding Phrases Regardless of Spacing), or you can search the text in Emacs (see Searching for a Phrase in Emacs).

You can specify more than one file to search. When you specify multiple files, each match that grep outputs is preceded by the name of the file it's in (and you can suppress this with the `-h' option.)

To output lines in all of the files in the current directory containing the word `CD', type:
```
$ grep CD * [RET]
```
To output lines in all of the `.txt' files in the `~/doc' directory containing the word `CD', suppressing the listing of file names in the output, type:
```
$ grep -h CD ~/doc/*.txt [RET]
```

Use the `-r' option to search a given directory recursively, searching all subdirectories it contains.

To output lines containing the word `CD' in all of the `.txt' files in the `~/doc' directory and in all of its subdirectories, type:
```
$ grep -r CD ~/doc/*.txt [RET]
```

NOTE: There are more complex things you can search for than simple strings, as will be explained in the next section.

Regular Expressions -- Matching Text Patterns

In addition to word and phrase searches, you can use grep to search for complex text patterns called regular expressions. A regular expression -- or "regexp"---is a text string of special characters that specifies a set of patterns to match.

Technically speaking, the word or phrase patterns described in the previous section are regular expressions -- just very simple ones. In a regular expression, most characters -- including letters and numbers -- represent themselves. For example, the regexp pattern 1 matches the string `1', and the pattern bee matches the string `bee'.

There are a number of reserved characters called metacharacters that don't represent themselves in a regular expression, but have a special meaning that is used to build complex patterns. These metacharacters are as follows: ., *, [, ], ^, $, and \.

To specify one of these literal characters in a regular expression, precede the character with a `\'.

To output lines in the file `catalog' that contain a `$' character, type:
```
$ grep '\$' catalog [RET]
```
To output lines in the file `catalog' that contain the string `$1.99', type:
```
$ grep '\$1\.99' catalog [RET]
```
To output lines in the file `catalog' that contain a `\' character, type:
```
$ grep '\\' catalog [RET]
```

The following table describes the special meanings of the metacharacters and gives examples of their usage. metacharacters, by using it before a number of other characters: matches either regexp it is between -- use it to join two separate regexps to match either of them. For example, a\|b matches either `a' or `b'. possible, but at least once. So a\+ matches one or more `a' adjacent characters, such as `aaa', `aa', and `a'. one times. So a\? matches `a' or an empty string -- which matches every line. (one specified to the left of this construction) that number of times -- so a\{4\} matches `aaaa'. Use \{number,\} to match the preceding regexp number or more times, \{,number\} to match the preceding regexp zero to number times, and \{number1,number2\} to match the preceding regexp from number1 to number2 times. an alternative; useful for combination regexps. For example, while moo\? matches `mo' or `moo', $moo$\? matches `moo'or the empty set.

METACHARACTER MEANING

. Matches any one character, with the exception of the newline character. For example, . matches `a', `1', `?', `.' (a literal period character), and so forth.
* Matches the preceding regexp zero or more times. For example, -* matches `-', `--', `---', `---------', and so forth. Now imagine a line of text with a million `-' characters somewhere in it, all marching off across the horizon, up into the blue sky, and through the clouds. A million `-' characters in a row. This pattern would match it. Now think of the same long parade, but it's a million and one `-' characters -- it matches that, too.

[ ] Encloses a character set, and matches any member of the set -- for example, [abc] matches either `a', `b', or `c'. In addition, the hyphen (`-') and caret (`^') characters have special meanings when used inside brackets:
- The hyphen specifies a range of characters, ordered according to their ASCII value (see Viewing a Character Chart). For example, [0-9] is synonymous with [0123456789]; [A-Za-z] matches one uppercase or lowercase letter. To include a literal `-' in a list, specify it as the last character in a list: so [0-9-] matches either a single digit character or a `-'.x

^ As the first character of a list, the caret means that any character except those in the list should be matched. For example, [^a] matches any character except `a', and [^0-9] matches any character except a numeric digit.
^ Matches the beginning of the line. So ^a matches `a' only when it is the first character on a line.

$ Matches the end of the line. So a$ matches `a' only when it is the last character on a line.
\ Use \ before a metacharacter when you want to specify that literal character. So \$ matches a dollar sign character (`$'), and \\ matches a single backslash character (`\').

In addition, use \ to build new
\| Called the `alternation operator'; it

\+ Matches the preceding regexp as many times as
\? Matches the regexp preceding it either zero or

\{number\} Matches the previous regexp
$regexp$ Group regexp together for
NOTE: The name `grep' derives from a command in the now-obsolete Unix ed line editor tool -- the ed command for searching globally through a file for a regular expression and then printing those lines was g/re/p, where re was the regular expression you'd use. Eventually, the grep command was written to do this search on a file when not using ed.(22) The following sections describe some regexp recipes for commonly searched-for patterns.

Beginning Match: Matching text at the beginning of a line.
End Match: Matching text at the end of a line.
Length Match: Matching a line as a pattern.
Any Regexps: Matching lines containing any of some regexps.
All Regexps: Matching lines containing all of some regexps.
Revert Match: Finding lines that don't match.
Certain Match: Matching lines of a certain length.
Spacing Search: Matching phrases regardless of spacing.
Context Pattern: Matching patterns within a context.
List Match: Matching a list of patterns.
More Regexps: Table of sample regular expressions.

Matching Lines Beginning with Certain Text

Use `^' in a regexp to denote the beginning of a line.

To output all lines in `/usr/dict/words' beginning with `pre', type:
```
$ grep '^pre' /usr/dict/words [RET]
```
To output all lines in the file `book' that begin with the text `in the beginning', regardless of case, type:
```
$ grep -i '^in the beginning' book [RET]
```

NOTE: These regexps were quoted with ' characters; this is because some shells otherwise treat the `^' character as a special "metacharacter" (see Passing Special Characters to Commands).(23)

Matching Lines Ending with Certain Text

Use `$' as the last character of quoted text to match that text only at the end of a line.

To output lines in the file `sayings' ending with an exclamation point, type:
```
$ grep '!$' sayings [RET]
```

NOTE: To use `$' in a regexp to find words that rhyme with a given word, see Listing Words that Match a Pattern.

Matching Lines of a Certain Length

To match lines of a particular length, use that number of `.' characters between `^' and `$'---for example, to match all lines that are two characters (or columns) wide, use `^..$' as the regexp to search for.

To output all lines in `/usr/dict/words' that are exactly two characters wide, type:
```
$ grep '^..$' /usr/dict/words [RET]
```

For longer lines, it is more useful to use a different construct: `^.\{number\}$', where number is the number of lines to match. Use `,' to specify a range of numbers.

To output all lines in `/usr/dict/words' that are exactly seventeen characters wide, type:
```
$ grep '^.\{17\}$' /usr/dict/words [RET]
```
To output all lines in `/usr/dict/words' that are twenty-five or more characters wide, type:
```
$ grep '^.\{25,\}$' /usr/dict/words [RET]
```

Matching Lines That Contain Any of Some Regexps

To match lines that contain any of a number of regexps, specify each of the regexps to search for between alternation operators (`\|') as the regexp to search for. Lines containing any of the given regexps will be output.

To output all lines in `playlist' that contain either the patterns `the sea' or `cake', type:
```
$ grep 'the sea\|cake' playlist [RET]
```

This command outputs any lines in `playlist' that match the patterns `the sea' or `cake', including lines matching both patterns.

Matching Lines That Contain All of Some Regexps

To output lines that match all of a number of regexps, use grep to output lines containing the first regexp you want to match, and pipe the output to a grep with the second regexp as an argument. Continue adding pipes to grep searches for all the regexps you want to search for.

To output all lines in `playlist' that contain both patterns `the sea' and `cake', regardless of case, type:
```
$ grep -i 'the sea' playlist | grep -i cake [RET]
```

NOTE: To match lines containing some regexps in a particular order, see Regexps for Common Situations.

Matching Lines That Don't Contain a Regexp

To output all lines in a text that don't contain a given pattern, use grep with the `-v' option -- this option reverts the sense of matching, selecting all non-matching lines.

To output all lines in `/usr/dict/words' that are not three characters wide, type:
```
$ grep -v '^...$' [RET]
```
To output all lines in `access_log' that do not contain the string `http', type:
```
$ grep -v http access_log [RET]
```

Matching Lines That Only Contain Certain Characters

To match lines that only contain certain characters, use the regexp `^[characters]*$', where characters are the ones to match.

To output lines in `/usr/dict/words' that only contain vowels, type:
```
$ grep -i '^[aeiou]*$' /usr/dict/words [RET]
```

The `-i' option matches characters regardless of case; so, in this example, all vowel characters are matched regardless of case.

Finding Phrases Regardless of Spacing

One way to search for a phrase that might occur with extra spaces between words, or across a line or page break, is to remove all linefeeds and extra spaces from the input, and then grep that.

To do this, pipe the input(24) to tr with `'\r\n:\>\|-'' as an argument to the `-d' option (removing all linebreaks from the input); pipe that to the fmt filter with the `-u' option (outputting the text with uniform spacing); and pipe that to grep with the pattern to search for.

To search across line breaks for the string `at the same time as' in the file `notes', type:
```
$ cat notes | tr -d '\r\n:\>\|-' | fmt -u | grep 'at the same time
as' [RET]
```

NOTE: The Emacs editor has its own special search for doing this -- see Searching for a Phrase in Emacs.

Finding Patterns in Certain Contexts

To search for a pattern that only occurs in a particular context, grep for the context in which it should occur, and pipe the output to another grep to search for the actual pattern.

For example, this can be useful to search for a given pattern only when it is quoted with an `>' character in an email message.

To list lines from the file `email-archive' that contain the word `narrative' only when it is quoted, type:
```
$ grep '^>' email-archive | grep narrative [RET]
```

You can also reverse the order and use the `-v' option to output all lines containing a given pattern that are not in a given context.

To list lines from the file `email-archive' that contain the word `narrative', but not when it is quoted, type:
```
$ grep narrative email-archive | grep -v '^>' [RET]
```

Using a List of Regexps to Match From

You can keep a list of regexps in a file, and use grep to search text for any of the patterns in the file. To do this, specify the name of the file containing the regexps to search for as an argument to the `-f' option.

This can be useful, for example, if you need to search a given text for a number of words -- keep each word on its own line in the regexp file.

To output all lines in `/usr/dict/words' containing any of the words listed in the file `forbidden-words', type:
```
$ grep -f forbidden-words /usr/dict/words [RET]
```
To output all lines in `/usr/dict/words' that do not contain any of the words listed in `forbidden-words', regardless of case, type:
```
$ grep -v -i -f forbidden-words /usr/dict/words [RET]
```

Regexps for Common Situations

The following table lists sample regexps and describes what they match. You can use these regexps as boilerplate when building your own regular expressions for searching text. Remember to enclose regexps in quotes.

TO MATCH LINES THAT ... USE THIS REGEXP

contain nine zeroes in a row 0\{9\}
are exactly four characters long ^....$ or ^.\{4\}$

are exactly seventy characters long ^.\{70\}$
begin with an asterisk character ^\*

begin with `tow' and end with `ing' ^tow.*ing$
contain a number [0-9]

do not contain a number ^[^0-9]*$
contain a year from 1991 through 1995 199[1-5]

contain a year from 1957 through 1969 $195[7-9]$\|$196[0-9]$
contain either `.txt' or `.text' \.te\?xt

contain `cat' then `gory' in the same word cat\.\+gory
contain `cat' then `gory' in the same line cat\.\+\?gory

contain a `q' not followed by a `u' q[^u]
contain any ftp, gopher, or `http' URLs $ftp\|gopher\|http\|$://.*\..*

contain `N', `T', and `K', with zero or more characters between each N.*T.*K

Searching More than Plain Text Files

The following recipes are for searching data other than in plain text files.

Compressed Search: Matching lines in compressed files.
URL Search: Matching lines in Web pages.

Matching Lines in Compressed Files

Use zgrep to search through text in files that are compressed. These files usually have a `.gz' file name extension, and can't be searched or otherwise read by other tools without uncompressing the file first (for more about compressed files, see Compressed Files).

The zgrep tool works just like grep, except it searches through the text of compressed files. It outputs matches to the given pattern as if you'd searched through normal, uncompressed files. It leaves the files compressed when it exits.

To search through the compressed file `README.gz' for the text `Linux', type:
```
$ zgrep Linux README.gz [RET]
```

Matching Lines in Web Pages

You can grep a Web page or other URL by giving the URL to lynx with the `-dump' option, and piping the output to grep.

To search the contents of the URL http://example.com/ for lines containing the text `gonzo' or `hunter', type:

$ lynx -dump http://example.com/ | grep 'gonzo\|hunter' [RET]

Outputting the Context of a Search

It is sometimes useful to see a matched line in its context in the file -- that is, to see some of the lines that surround it.

Use the `-C' option with grep to output results in context---it outputs matched lines with two lines of "context" both before and after each match. To specify the number of context lines output both before and after matched lines, use that number as an option instead of `-C'.

To search `/usr/dict/words' for lines matching `tsch' and output two lines of context before and after each line of output, type:
```
$ grep -C tsch /usr/dict/words [RET]
```
To search `/usr/dict/words' for lines matching `tsch' and output six lines of context before and after each line of output, type:
```
$ grep -6 tsch /usr/dict/words [RET]
```

To output matches and the lines before them, use `-B'; to output matches and the lines after them, use `-A'. Give a numeric option with either of these options to specify that number of context lines.

To search `/usr/dict/words' for lines matching `tsch' and output two lines of context before each line of output, type:
```
$ grep -B tsch /usr/dict/words [RET]
```
To search `/usr/dict/words' for lines matching `tsch' and output six lines of context after each line of output, type:
```
$ grep -A6 tsch /usr/dict/words [RET]
```
To search `/usr/dict/words' for lines matching `tsch' and output ten lines of context before and three lines of context after each line of output, type:
```
$ grep -B10 -A3 tsch /usr/dict/words [RET]
```

Searching and Replacing Text

A quick way to search and replace some text in a file is to use the following one-line perl command:

$ perl -pi -e "s/oldstring/newstring/g;" filespec [RET]

In this example, oldstring is the string to search, newstring is the string to replace it with, and filespec is the name of the file or files to work on. You can use this for more than one file.

To replace the string `helpless' with the string `helpful' in all files in the current directory, type:
```
$ perl -pi -e "s/helpless/helpful/g;" * [RET]
```

You can also search and replace text in an Emacs buffer; to do this, use the replace-regexp function and give both the expression to search for and the expression to replace it with.

To replace the text `helpless' with the text `helpful' in the current buffer, type:
```
M-x replace-regexp [RET] helpless [RET] helpful [RET]
```

NOTE: You can also search and replace text in most text editors, including Emacs; see Searching and Replacing in Emacs.

Searching Text in Emacs

The following sections show ways of searching for text in Emacs -- incrementally, for a word or phrase, or for a pattern -- and for searching and then replacing text.

Emacs Incremental Search: Searching incrementally.
Emacs Word Search: Searching for a word or phrase.
Emacs Regexp Search: Searching for a pattern.
Emacs Replace: Replacing text in Emacs.

Searching Incrementally in Emacs

Type C-s to use the Emacs incremental search function. It takes text as input in the minibuffer and it searches for that text from point toward the end of the current buffer. Type C-s again to search for the next occurrence of the text you're searching for; this works until no more matches occur. Then, Emacs reports `Failing I-search' in the minibuffer; type C-s again to wrap to the beginning of the buffer and continue the search from there.

It gets its name "incremental" because it begins searching immediately when you start to type text, and so it builds a search string in increments---for example, if you want to search for the word `sunflower' in the current buffer, you start to type

C-s s

At that point Emacs searches forward through the buffer to the first `s' character, and highlights it. Then, as you type u, it searches forward to the first `su' in the buffer and highlights that (if a `u' appears immediately after the `s' it first stopped at, it stays where it's at, and highlights the `s' and the `u'). It continues to do this as long as you type and as long as there is a match in the current buffer. As soon as what you type does not appear in the buffer, Emacs beeps and a message appears in the minibuffer stating that the search has failed.

To search for the next instance of the last string you gave, type C-s again; if you keep [CTRL] held down, every time you press the [S] key, Emacs will advance to the next match in the buffer.

This is generally the fastest and most common type of search you will use in Emacs.

You can do an incremental search through the buffer in reverse---that is, from point to the beginning of the buffer -- with the isearch-backward function, C-r.

To search for the text `moon' in the current buffer from point in reverse to the beginning of the buffer, type:
```
C-r moon
```

Searching for a Phrase in Emacs

Like grep, the Emacs incremental search only works on lines of text, so it only finds phrases on a single line. If you search for `hello, world' with the incremental search and the text `hello,' appears at the end of a line and the text `world' appears at the beginning of the next line, it won't find it.

To find a multi-word phrase across line breaks, use the word-search-forward function. It searches for a phrase or words regardless of punctuation or spacing.

To search forward through the current buffer for the phrase `join me', type:
```
M-x word-search-forward [RET] join me [RET]
```

NOTE: The word-search-backward function does the same as word-search-forward, except it searches backward through the buffer, from point to the beginning of the buffer.

Searching for a Regexp in Emacs

Use the search-forward-regexp function to search for a regular expression from point to the end of the current buffer.

To search forward through the current buffer for the regexp `@.*\.org', type:
```
M-x search-forward-regexp [RET] @.*\.org [RET]
```

The keyboard accelerator for this command is M-C-s---on most keyboards, you press and release [ESC] and then hold down [CTRL] while you type s. To repeat the last regexp search you made, type M-C-s C-s; then, as long as you have [CTRL] held down, you can keep typing s to advance to the next match, just as you would with an incremental search.

NOTE: There is a search-backward-regexp function that is identical but searches backward, from point to the top of the buffer.

Searching and Replacing in Emacs

To search for and replace text in Emacs, use the replace-regexp function. When you run this function, Emacs will ask for both the text or regexp to search for and the text to replace it with.

To replace the text `day' with the text `night' in the current buffer, type:
```
M-x replace-regexp [RET] day [RET] night [RET]
```

This function is especially useful for replacing control characters with text, or for replacing text with control characters, which you can specify with C-q, the quoted-insert function (see Inserting Special Characters in Emacs).

To replace all the `^M' characters in the current buffer with regular linefeeds, type:
```
M-x replace-regexp [RET] C-q C-m [RET] C-q 012 [RET] [RET]
```

Searching Text in Less

There are two useful commands in less for searching through text: / and ?. To search forward through the text, type / followed by a regexp to search for; to search backward through the text, use ?.

When you do a search, the word or other regexp you search for appears highlighted throughout the text.

To search forward through the text you are perusing for the word `cat', type:
```
/cat [RET]
```
To search backward through the text you are perusing for the regexp `[ch]at', type:
```
?[ch]at [RET]
```

[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->]

METACHARACTER		MEANING
`.`		Matches any one character, with the exception of the newline character. For example, . matches `a', `1', `?', `.' (a literal period character), and so forth.
`*`		Matches the preceding regexp zero or more times. For example, -* matches `-', `--', `---', `---------', and so forth. Now imagine a line of text with a million `-' characters somewhere in it, all marching off across the horizon, up into the blue sky, and through the clouds. A million `-' characters in a row. This pattern would match it. Now think of the same long parade, but it's a million and one `-' characters -- it matches that, too.
`[` `]`		Encloses a character set, and matches any member of the set -- for example, [abc] matches either `a', `b', or `c'. In addition, the hyphen (`-') and caret (`^') characters have special meanings when used inside brackets:
	-	The hyphen specifies a range of characters, ordered according to their ASCII value (see Viewing a Character Chart). For example, [0-9] is synonymous with [0123456789]; [A-Za-z] matches one uppercase or lowercase letter. To include a literal `-' in a list, specify it as the last character in a list: so [0-9-] matches either a single digit character or a `-'.x
	^	As the first character of a list, the caret means that any character except those in the list should be matched. For example, [^a] matches any character except `a', and [^0-9] matches any character except a numeric digit.
`^`		Matches the beginning of the line. So ^a matches `a' only when it is the first character on a line.
`$`		Matches the end of the line. So a$ matches `a' only when it is the last character on a line.
`\`		Use \ before a metacharacter when you want to specify that literal character. So \$ matches a dollar sign character (`$'), and \\ matches a single backslash character (`\').
		In addition, use \ to build new
	\\|	Called the `alternation operator'; it
	\+	Matches the preceding regexp as many times as
	\?	Matches the regexp preceding it either zero or
	`\{number\}`	Matches the previous regexp
	`\(regexp\)`	Group `regexp` together for

TO MATCH LINES THAT ...	USE THIS REGEXP
contain nine zeroes in a row	`0\{9\}`
are exactly four characters long	`^....$` `or` `^.\{4\}$`
are exactly seventy characters long	`^.\{70\}$`
begin with an asterisk character	`^\*`
begin with `tow' and end with `ing'	`^tow.*ing$`
contain a number	`[0-9]`
do not contain a number	`^[^0-9]*$`
contain a year from 1991 through 1995	`199[1-5]`
contain a year from 1957 through 1969	`\(195[7-9]\)\\|\(196[0-9]\)`
contain either `.txt' or `.text'	`\.te\?xt`
contain `cat' then `gory' in the same word	`cat\.\+gory`
contain `cat' then `gory' in the same line	`cat\.\+\?gory`
contain a `q' not followed by a `u'	`q[^u]`
contain any `ftp`, `gopher`, or `http' URLs	`\(ftp\\|gopher\\|http\\|\)://.\..`
contain `N', `T', and `K', with zero or more characters between each	`N.T.K`