[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->]
Web dsl.org

Formatting Text

Methods and tools for changing the arrangement or presentation of text are often useful for preparing text for printing. This chapter discusses ways of changing the spacing of text and setting up pages, of underlining and sorting and reversing text, and of numbering lines of text.

Spacing Text

These recipes are for changing the spacing of text -- the whitespace that exists between words, lines, and paragraphs.

The filters described in this section send output to standard output by default; to save their output to a file, use shell redirection (see Redirecting Output to a File).

Eliminating Extra Spaces in Text

To eliminate extra whitespaces within lines of text, use the fmt filter; to eliminate extra whitespace between lines of text, use cat.

Use fmt with the `-u' option to output text with "uniform spacing," where the space between words is reduced to one space character and the space between sentences is reduced to two space characters.

Use cat with the `-s' option to "squeeze" multiple adjacent blank lines into one.

You can combine both of these commands to output text with multiple adjacent lines removed and give it a unified spacing between words. The following example shows how the output of the combined commands is sent to less so that it can be perused on the screen.

Notice that in this example, both fmt and less worked on their standard input instead of on a file -- the standard output of cat (the contents of `term-paper' with extra blank lines squeezed out) was passed to the standard input of fmt, and its standard output (the space-squeezed `term-paper', now with uniform spacing) was sent to the standard input of less, which displayed it on the screen.

Single-Spacing Text

There are many methods for single-spacing text. To remove all empty lines from text output, use grep with the regular expression `.', which matches any character, and therefore matches any line that isn't empty (see Regular Expressions -- Matching Text Patterns). You can then redirect this output to a file, or pipe it to other commands; the original file is not altered.

This command outputs all lines that are not empty -- so lines containing only non-printing characters, such as spaces and tabs, will still be output.

To remove from the output all empty lines, and all lines that consist of only space characters, use `[^ ].' as the regexp to search for. But this regexp will still output lines that contain only tab characters; to remove from the output all empty lines and lines that contain only a combination of tab or space characters, use `[^[:space:]].' as the regexp to search for. It uses the special predefined `[:space:]' regexp class, which matches any kind of space character at all, including tabs.

If a file is already double-spaced, where all even lines are blank, you can remove those lines from the output by using sed with the `n;d' expression.

Double-Spacing Text

To double-space text, where one blank line is inserted between each line in the original text, use the pr tool with the `-d' option. By default, pr paginates text and puts a header at the top of each page with the current date, time, and page number; give the `-t' option to omit this header.

To send the output directly to the printer for printing, you would pipe the output to lpr:

$ pr -d -t term-paper | lpr [RET]

NOTE: The pr ("print") tool is a text pre-formatter, often used to paginate and otherwise prepare text files for printing; there is more discussion on the use of this tool in Paginating Text.

Triple-Spacing Text

To triple-space text, where two blank lines are inserted between each line of the original text, use sed with the `'G;G'' expression.

The `G' expression appends one blank line to each line of sed's output; using `;' you can specify more than one blank line to append (but you must quote this command, because the semicolon (`;') has meaning to the shell -- see Passing Special Characters to Commands). You can use multiple `G' characters to output text with more than double or triple spaces.

The usage of sed is described in Editing Streams of Text.

Adding Line Breaks to Text

Sometimes a file will not have line breaks at the end of each line (this commonly happens during file conversions between operating systems). To add line breaks to a file that does not have them, use the text formatter fmt. It outputs text with lines arranged up to a specified width; if no length is specified, it formats text up to a width of 75 characters per line.

Use the `-w' option to specify the maximum line width.

Adding Margins to Text

Giving text an extra left margin is especially good when you want to print a copy and punch holes in it for use with a three-ring binder.

To output a text file with a larger left margin, use pr with the file name as an argument; give the `-t' option (to disable headers and footers), and, as an argument to the `-o' option, give the number of spaces to offset the text. Add the number of spaces to the page width (whose default is 72) and specify this new width as an argument to the `-w' option.

This command is almost always used for printing, so the output is usually just piped to lpr instead of saved to a file. Many text documents have a width of 80 and not 72 columns; if you are printing such a document and need to keep the 80 columns across the page, specify a new width of 85. If your printer can only print 80 columns of text, specify a width of 80; the text will be reformatted to 75 columns after the 5-column margin.

Swapping Tab and Space Characters

Use the expand and unexpand tools to swap tab characters for space characters, and to swap space characters with tabs, respectively.

Both tools take a file name as an argument and write changes to the standard output; if no files are specified, they work on the standard input.

To convert tab characters to spaces, use expand. To convert only the initial or leading tabs on each line, give the `-i' option; the default action is to convert all tabs.

To convert multiple space characters to tabs, use unexpand. By default, it only converts leading spaces into tabs, counting eight space characters for each tab. Use the `-a' option to specify that all instances of eight space characters be converted to tabs.

To specify the number of spaces to convert to a tab, give that number as an argument to the `-t' option.

Paginating Text

The formfeed character, ASCII C-l or octal code 014, is the delimiter used to paginate text. When you send text with a formfeed character to the printer, the current page being printed is ejected and a new page begins -- thus, you can paginate a text file by inserting formfeed characters at a place where you want a page break to occur.

To insert formfeed characters in a text file, use the pr filter.

Give the `-f' option to omit the footer and separate pages of output with the formfeed character, and use `-h ""' to output a blank header (otherwise, the current date and time, file name, and current page number are output at the top of each page).

By default, pr outputs pages of 66 lines each. You can specify the page length as an argument to the `-l' option.

NOTE: If a page has more lines than a printer can fit on a physical sheet of paper, it will automatically break the text at that line as well as at the places in the text where there are formfeed characters.

You can paginate text in Emacs by manually inserting formfeed characters where you want them -- see Inserting Special Characters in Emacs.

Placing Headers on Each Page

The pr tool is a general-purpose page formatter and print-preparation utility. By default, pr outputs text in pages of 66 lines each, with headers at the top of each page containing the date and time, file name, and page number, and footers containing five blank lines.

Placing Text in Columns

You can also use pr to put text in columns -- give the number of columns to output as an argument. Use the `-t' option to omit the printing of the default headers and footers.

Options Available When Paginating Text

The following table describes some of pr's options; see the pr info for a complete description of its capabilities (see Using the GNU Info System).
+first:last Specify the first and last page to process; the last page can be omitted, so +7 begins processing with the seventh page and continues until the end of the file is reached.
-column Specify the number of columns to output text in, making all columns fit the page width.
-a Print columns across instead of down.
-c Output control characters in hat notation and print all other unprintable characters in "octal backslash" notation.
-d Specify double-spaced output.
-f Separate pages of output with a formfeed character instead of a footer of blank lines (63 lines of text per 66-line page instead of 53).
-h header Specify the header to use instead of the default; specify -h "" for a blank header.
-l length Specify the page length to be length lines (default 66). If page length is less than 11, headers and footers are omitted and existing form feeds are ignored.
-m Use when specifying multiple files; this option merges and outputs them in parallel, one per column.
-o spaces Set the number of spaces to use in the left margin (default 0).
-t Omit the header and footer on each page, but retain existing formfeeds.
-T Omit the header and footer on each page, as well as existing formfeeds.
-v Output non-printing characters in "octal backslash" notation.
-w width Specify the page width to use, in characters (default 72).
NOTE: It's also common to use pr to change the spacing of text (see Spacing Text).

Underlining Text

In the days of typewriters, text that was meant to be set in an italicized font was denoted by underlining the text with underscore characters; now, it's common practice to denote an italicized word in plain text by typing an underscore character, `_', just before and after a word in a text file, like `_this_'.

Some text markup languages use different methods for denoting italics; for example, in TeX or LaTeX files, italicized text is often denoted with brackets and the `\it' command, like `{\it this}'. (LaTeX files use the same format, but `\emph' is often used in place of `\it'.)

You can convert one form to the other by using the Emacs replace-regular-expression function and specifying the text to be replaced as a regexp (see Regular Expressions -- Matching Text Patterns).

Both examples above used the special regexp symbol `\1', which matches the same text matched by the first `\( ... \)' construct in the previous regexp. See Info file `emacs-e20.info', node `Regexps' for more information on regexp syntax in Emacs.

To put a literal underline under text, you need to use a text editor to insert a C-h character followed by an underscore (`_') immediately after each character you want to underline; you can insert the C-h in Emacs with the C-q function (see Inserting Special Characters in Emacs).

When a text file contains these literal underlines, use the ul tool to output the file so that it is viewable by the terminal you are using; this is also useful for printing (pipe the output of ul to lpr).

To output such text without the backspace character, C-h, in the output, use col with the `-u' option.

Sorting Text

You can sort a list in a text file with sort. By default, it outputs text in ascending alphabetical order; use the `-r' option to reverse the sort and output text in descending alphabetical order.

For example, suppose a file `provinces' contains the following:


The following table describes some of sort's options.
-b Ignore leading blanks on each line when sorting.
-d Sort in "phone directory" order, with only letters, digits, and blanks being sorted.
-f When sorting, fold lowercase letters into their uppercase equivalent, so that differences in case are ignored.
-i Ignore all spaces and all non-typewriter characters when sorting.
-n Sort numerically instead of by character value.
-o file Write output to file instead of standard output.

Numbering Lines of Text

There are several ways to number lines of text.

One way to do it is to use the nl ("number lines") tool. Its default action is to write its input (either the file names given as an argument, or the standard input) to the standard output, with an indentation and all non-empty lines preceded with line numbers.

You can set the numbering style with the `-b' option followed by an argument. The following table lists the possible arguments and describes the numbering style they select.
a Number all lines.
t Number only non-blank lines. This is the default.
n Do not number lines.
pregexp Only number lines that contain the regular expression regexp (see Regular Expressions -- Matching Text Patterns).
The default is for line numbers to start with one, and increment by one. Set the initial line number by giving an argument to the `-v' option, and set the increment by giving an argument to the `-i' option.

The other way to number lines is to use cat with one of the following two options: the `-n' option numbers each line of its input text, while the `-b' option only numbers non-blank lines. In the preceding examples, output from cat is piped to less for perusal; the original file is not altered. To take an input file, number its lines, and then write the line-numbered version to a new file, send the standard output of the cat command to the new file to write.

Reversing Text

The tac command is similar to cat, but it outputs text in reverse order. There is another difference---tac works on records, sections of text with separator strings, instead of lines of text. Its default separator string is the linebreak character, so by default tac outputs files in line-for-line reverse order.

Specify a different separator with the `-s' option. This is often useful when specifying non-printing characters such as formfeeds. To specify such a character, use the ANSI-C method of quoting (see Passing Special Characters to Commands).

The preceding example uses the formfeed, or page break, character as the delimiter, and so it outputs the file `prizes' in page-for-page reverse order, with the last page output first.

Use the `-r' option to use a regular expression for the separator string (see Regular Expressions -- Matching Text Patterns). You can build regular expressions to output text in word-for-word and character-for-character reverse order:

To reverse the characters on each line, use rev.

[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->]