[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->]
Google
 
Web dsl.org


Searching Text

It's quite common to search through text for a given sequence of characters (such as a word or phrase), called a string, or even for a pattern describing a set of such strings; this chapter contains recipes for doing these kind of things.

Searching for a Word or Phrase

The primary command used for searching through text is the rather froglike-sounding tool called grep (the origin of its name is explained in Regular Expressions -- Matching Text Patterns, where its advanced usage is discussed). It outputs lines of its input that contain a given string or pattern.

To search for a word, give that word as the first argument. By default, grep searches standard input; give the name of a file to search as the second argument.

To search for a phrase, specify it in quotes.

The preceding example outputs all lines in the file `catalog' that contain the exact string `Compact Disc'; it will not match, however, lines containing `compact disc' or any other variation on the case of letters in the search pattern. Use the `-i' option to specify that matches are to be made regardless of case.

This command outputs lines in the file `catalog' containing any variation of the pattern `compact disc', including `Compact Disc', `COMPACT DISC', and `comPact dIsC'.

One thing to keep in mind is that grep only matches patterns that appear on a single line, so in the preceding example, if one line in `catalog' ends with the word `compact' and the next begins with `disc', grep will not match either line. There is a way around this with grep (see Finding Phrases Regardless of Spacing), or you can search the text in Emacs (see Searching for a Phrase in Emacs).

You can specify more than one file to search. When you specify multiple files, each match that grep outputs is preceded by the name of the file it's in (and you can suppress this with the `-h' option.)

Use the `-r' option to search a given directory recursively, searching all subdirectories it contains.

NOTE: There are more complex things you can search for than simple strings, as will be explained in the next section.

Regular Expressions -- Matching Text Patterns

In addition to word and phrase searches, you can use grep to search for complex text patterns called regular expressions. A regular expression -- or "regexp"---is a text string of special characters that specifies a set of patterns to match.

Technically speaking, the word or phrase patterns described in the previous section are regular expressions -- just very simple ones. In a regular expression, most characters -- including letters and numbers -- represent themselves. For example, the regexp pattern 1 matches the string `1', and the pattern bee matches the string `bee'.

There are a number of reserved characters called metacharacters that don't represent themselves in a regular expression, but have a special meaning that is used to build complex patterns. These metacharacters are as follows: ., *, [, ], ^, $, and \.

To specify one of these literal characters in a regular expression, precede the character with a `\'.

The following table describes the special meanings of the metacharacters and gives examples of their usage. metacharacters, by using it before a number of other characters: matches either regexp it is between -- use it to join two separate regexps to match either of them. For example, a\|b matches either `a' or `b'. possible, but at least once. So a\+ matches one or more `a' adjacent characters, such as `aaa', `aa', and `a'. one times. So a\? matches `a' or an empty string -- which matches every line. (one specified to the left of this construction) that number of times -- so a\{4\} matches `aaaa'. Use \{number,\} to match the preceding regexp number or more times, \{,number\} to match the preceding regexp zero to number times, and \{number1,number2\} to match the preceding regexp from number1 to number2 times. an alternative; useful for combination regexps. For example, while moo\? matches `mo' or `moo', \(moo\)\? matches `moo'or the empty set.
METACHARACTER MEANING
. Matches any one character, with the exception of the newline character. For example, . matches `a', `1', `?', `.' (a literal period character), and so forth.
* Matches the preceding regexp zero or more times. For example, -* matches `-', `--', `---', `---------', and so forth. Now imagine a line of text with a million `-' characters somewhere in it, all marching off across the horizon, up into the blue sky, and through the clouds. A million `-' characters in a row. This pattern would match it. Now think of the same long parade, but it's a million and one `-' characters -- it matches that, too.
[ ] Encloses a character set, and matches any member of the set -- for example, [abc] matches either `a', `b', or `c'. In addition, the hyphen (`-') and caret (`^') characters have special meanings when used inside brackets:
- The hyphen specifies a range of characters, ordered according to their ASCII value (see Viewing a Character Chart). For example, [0-9] is synonymous with [0123456789]; [A-Za-z] matches one uppercase or lowercase letter. To include a literal `-' in a list, specify it as the last character in a list: so [0-9-] matches either a single digit character or a `-'.x
^ As the first character of a list, the caret means that any character except those in the list should be matched. For example, [^a] matches any character except `a', and [^0-9] matches any character except a numeric digit.
^ Matches the beginning of the line. So ^a matches `a' only when it is the first character on a line.
$ Matches the end of the line. So a$ matches `a' only when it is the last character on a line.
\ Use \ before a metacharacter when you want to specify that literal character. So \$ matches a dollar sign character (`$'), and \\ matches a single backslash character (`\').
In addition, use \ to build new
\| Called the `alternation operator'; it
\+ Matches the preceding regexp as many times as
\? Matches the regexp preceding it either zero or
\{number\} Matches the previous regexp
\(regexp\) Group regexp together for
NOTE: The name `grep' derives from a command in the now-obsolete Unix ed line editor tool -- the ed command for searching globally through a file for a regular expression and then printing those lines was g/re/p, where re was the regular expression you'd use. Eventually, the grep command was written to do this search on a file when not using ed.(22) The following sections describe some regexp recipes for commonly searched-for patterns.

Matching Lines Beginning with Certain Text

Use `^' in a regexp to denote the beginning of a line.

NOTE: These regexps were quoted with ' characters; this is because some shells otherwise treat the `^' character as a special "metacharacter" (see Passing Special Characters to Commands).(23)

Matching Lines Ending with Certain Text

Use `$' as the last character of quoted text to match that text only at the end of a line.

NOTE: To use `$' in a regexp to find words that rhyme with a given word, see Listing Words that Match a Pattern.

Matching Lines of a Certain Length

To match lines of a particular length, use that number of `.' characters between `^' and `$'---for example, to match all lines that are two characters (or columns) wide, use `^..$' as the regexp to search for.

For longer lines, it is more useful to use a different construct: `^.\{number\}$', where number is the number of lines to match. Use `,' to specify a range of numbers.

Matching Lines That Contain Any of Some Regexps

To match lines that contain any of a number of regexps, specify each of the regexps to search for between alternation operators (`\|') as the regexp to search for. Lines containing any of the given regexps will be output.

This command outputs any lines in `playlist' that match the patterns `the sea' or `cake', including lines matching both patterns.

Matching Lines That Contain All of Some Regexps

To output lines that match all of a number of regexps, use grep to output lines containing the first regexp you want to match, and pipe the output to a grep with the second regexp as an argument. Continue adding pipes to grep searches for all the regexps you want to search for.

NOTE: To match lines containing some regexps in a particular order, see Regexps for Common Situations.

Matching Lines That Don't Contain a Regexp

To output all lines in a text that don't contain a given pattern, use grep with the `-v' option -- this option reverts the sense of matching, selecting all non-matching lines.

Matching Lines That Only Contain Certain Characters

To match lines that only contain certain characters, use the regexp `^[characters]*$', where characters are the ones to match.

The `-i' option matches characters regardless of case; so, in this example, all vowel characters are matched regardless of case.

Finding Phrases Regardless of Spacing

One way to search for a phrase that might occur with extra spaces between words, or across a line or page break, is to remove all linefeeds and extra spaces from the input, and then grep that.

To do this, pipe the input(24) to tr with `'\r\n:\>\|-'' as an argument to the `-d' option (removing all linebreaks from the input); pipe that to the fmt filter with the `-u' option (outputting the text with uniform spacing); and pipe that to grep with the pattern to search for.

NOTE: The Emacs editor has its own special search for doing this -- see Searching for a Phrase in Emacs.

Finding Patterns in Certain Contexts

To search for a pattern that only occurs in a particular context, grep for the context in which it should occur, and pipe the output to another grep to search for the actual pattern.

For example, this can be useful to search for a given pattern only when it is quoted with an `>' character in an email message.

You can also reverse the order and use the `-v' option to output all lines containing a given pattern that are not in a given context.

Using a List of Regexps to Match From

You can keep a list of regexps in a file, and use grep to search text for any of the patterns in the file. To do this, specify the name of the file containing the regexps to search for as an argument to the `-f' option.

This can be useful, for example, if you need to search a given text for a number of words -- keep each word on its own line in the regexp file.

Regexps for Common Situations

The following table lists sample regexps and describes what they match. You can use these regexps as boilerplate when building your own regular expressions for searching text. Remember to enclose regexps in quotes.
TO MATCH LINES THAT ... USE THIS REGEXP
contain nine zeroes in a row 0\{9\}
are exactly four characters long ^....$ or ^.\{4\}$
are exactly seventy characters long ^.\{70\}$
begin with an asterisk character ^\*
begin with `tow' and end with `ing' ^tow.*ing$
contain a number [0-9]
do not contain a number ^[^0-9]*$
contain a year from 1991 through 1995 199[1-5]
contain a year from 1957 through 1969 \(195[7-9]\)\|\(196[0-9]\)
contain either `.txt' or `.text' \.te\?xt
contain `cat' then `gory' in the same word cat\.\+gory
contain `cat' then `gory' in the same line cat\.\+\?gory
contain a `q' not followed by a `u' q[^u]
contain any ftp, gopher, or `http' URLs \(ftp\|gopher\|http\|\)://.*\..*
contain `N', `T', and `K', with zero or more characters between each N.*T.*K

Searching More than Plain Text Files

The following recipes are for searching data other than in plain text files.

Matching Lines in Compressed Files

Use zgrep to search through text in files that are compressed. These files usually have a `.gz' file name extension, and can't be searched or otherwise read by other tools without uncompressing the file first (for more about compressed files, see Compressed Files).

The zgrep tool works just like grep, except it searches through the text of compressed files. It outputs matches to the given pattern as if you'd searched through normal, uncompressed files. It leaves the files compressed when it exits.

Matching Lines in Web Pages

You can grep a Web page or other URL by giving the URL to lynx with the `-dump' option, and piping the output to grep.

$ lynx -dump http://example.com/ | grep 'gonzo\|hunter' [RET]

Outputting the Context of a Search

It is sometimes useful to see a matched line in its context in the file -- that is, to see some of the lines that surround it.

Use the `-C' option with grep to output results in context---it outputs matched lines with two lines of "context" both before and after each match. To specify the number of context lines output both before and after matched lines, use that number as an option instead of `-C'.

To output matches and the lines before them, use `-B'; to output matches and the lines after them, use `-A'. Give a numeric option with either of these options to specify that number of context lines.

Searching and Replacing Text

A quick way to search and replace some text in a file is to use the following one-line perl command:

$ perl -pi -e "s/oldstring/newstring/g;" filespec [RET]

In this example, oldstring is the string to search, newstring is the string to replace it with, and filespec is the name of the file or files to work on. You can use this for more than one file.

You can also search and replace text in an Emacs buffer; to do this, use the replace-regexp function and give both the expression to search for and the expression to replace it with.

NOTE: You can also search and replace text in most text editors, including Emacs; see Searching and Replacing in Emacs.

Searching Text in Emacs

The following sections show ways of searching for text in Emacs -- incrementally, for a word or phrase, or for a pattern -- and for searching and then replacing text.

Searching Incrementally in Emacs

Type C-s to use the Emacs incremental search function. It takes text as input in the minibuffer and it searches for that text from point toward the end of the current buffer. Type C-s again to search for the next occurrence of the text you're searching for; this works until no more matches occur. Then, Emacs reports `Failing I-search' in the minibuffer; type C-s again to wrap to the beginning of the buffer and continue the search from there.

It gets its name "incremental" because it begins searching immediately when you start to type text, and so it builds a search string in increments---for example, if you want to search for the word `sunflower' in the current buffer, you start to type

C-s s

At that point Emacs searches forward through the buffer to the first `s' character, and highlights it. Then, as you type u, it searches forward to the first `su' in the buffer and highlights that (if a `u' appears immediately after the `s' it first stopped at, it stays where it's at, and highlights the `s' and the `u'). It continues to do this as long as you type and as long as there is a match in the current buffer. As soon as what you type does not appear in the buffer, Emacs beeps and a message appears in the minibuffer stating that the search has failed.

To search for the next instance of the last string you gave, type C-s again; if you keep [CTRL] held down, every time you press the [S] key, Emacs will advance to the next match in the buffer.

This is generally the fastest and most common type of search you will use in Emacs.

You can do an incremental search through the buffer in reverse---that is, from point to the beginning of the buffer -- with the isearch-backward function, C-r.

Searching for a Phrase in Emacs

Like grep, the Emacs incremental search only works on lines of text, so it only finds phrases on a single line. If you search for `hello, world' with the incremental search and the text `hello,' appears at the end of a line and the text `world' appears at the beginning of the next line, it won't find it.

To find a multi-word phrase across line breaks, use the word-search-forward function. It searches for a phrase or words regardless of punctuation or spacing.

NOTE: The word-search-backward function does the same as word-search-forward, except it searches backward through the buffer, from point to the beginning of the buffer.

Searching for a Regexp in Emacs

Use the search-forward-regexp function to search for a regular expression from point to the end of the current buffer.

The keyboard accelerator for this command is M-C-s---on most keyboards, you press and release [ESC] and then hold down [CTRL] while you type s. To repeat the last regexp search you made, type M-C-s C-s; then, as long as you have [CTRL] held down, you can keep typing s to advance to the next match, just as you would with an incremental search.

NOTE: There is a search-backward-regexp function that is identical but searches backward, from point to the top of the buffer.

Searching and Replacing in Emacs

To search for and replace text in Emacs, use the replace-regexp function. When you run this function, Emacs will ask for both the text or regexp to search for and the text to replace it with.

This function is especially useful for replacing control characters with text, or for replacing text with control characters, which you can specify with C-q, the quoted-insert function (see Inserting Special Characters in Emacs).

Searching Text in Less

There are two useful commands in less for searching through text: / and ?. To search forward through the text, type / followed by a regexp to search for; to search backward through the text, use ?.

When you do a search, the word or other regexp you search for appears highlighted throughout the text.


[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->]