[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->]
Web dsl.org

Managing Files

File management tools include those for splitting, comparing, and compressing files, making backup archives, and tracking file revisions. Other management tools exist for determining the contents of a file, and for changing its timestamp.

Determining File Type and Format

When we speak of a file's type, we are referring to the kind of data it contains, which may include text, executable commands, or some other data; this data is organized in a particular way in the file, and this organization is called its format. For example, an image file might contain data in the JPEG image format, or a text file might contain unformatted text in the English language or text formatted in the TeX markup language.

The file tool analyzes files and indicates their type and -- if known -- the format of the data they contain. Supply the name of a file as an argument to file and it outputs the name of the file, followed by a description of its format and type.

$ file /usr/doc/HOWTO/README.gz [RET] /usr/doc/HOWTO/README.gz: gzip compressed data, deflated, original filename, last modified: Sun Apr 26 02:51:48 1998, os: Unix $

This command reports that the file `/usr/doc/HOWTO/README.gz' contains data that has been compressed with the gzip tool.

To determine the original format of the data in a compressed file, use the `-z' option.

$ file -z /usr/doc/HOWTO/README.gz [RET] /usr/doc/HOWTO/README.gz: English text (gzip compressed data, deflated, original filename, last modified: Sun Apr 26 02:51:48 1998, os: Unix) $

This command reports that the data in `/usr/doc/HOWTO/README.gz', a compressed file, is English text.

NOTE: Currently, file differentiates among more than 100 different data formats, including several human languages, many sound and graphics formats, and executable files for many different operating systems.

Changing File Modification Time

Use touch to change a file's timestamp without modifying its contents. Give the name of the file to be changed as an argument. The default action is to change the timestamp to the current time.

To specify a timestamp other than the current system time, use the `-d' option, followed by the date and time that should be used enclosed in quote characters. You can specify just the date, just the time, or both.

NOTE: When only the date is given, the time is set to `0:00'; when no year is given, the current year is used.

See Info file `fileutils.info', node `Date input formats', for more information on date input formats.

Splitting a File into Smaller Ones

It's sometimes necessary to split one file into a number of smaller ones. For example, suppose you have a very large sound file in the near-CD-quality MPEG2, level 3 ("MP3") format. Your file, `large.mp3', is 4,394,422 bytes in size, and you want to transfer it from your desktop to your laptop, but your laptop and desktop are not connected on a network -- the only way to transfer files between them is by floppy disk. Because this file is much too large to fit on one floppy, you use split.

The split tool copies a file, chopping up the copy into separate files of a specified size. It takes as optional arguments the name of the input file (using standard input if none is given) and the file name prefix to use when writing the output files (using `x' if none is given). The output files' names will consist of the file prefix followed by a group of letters: `aa', `ab', `ac', and so on -- the default output file names would be `xaa', `xab', and so on.

Specify the number of lines to put in each output file with the `-l' option, or use the `-b' option to specify the number of bytes to put in each output file. To specify the output files' sizes in kilobytes or megabytes, use the `-b' option and append `k' or `m', respectively, to the value you supply. If neither `-l' nor `-b' is used, split defaults to using 1,000 lines per output file.

This command creates five new files whose names begin with `large.mp3.'. The first four files are one megabyte in size, while the last file is 200,118 bytes -- the remaining portion of the original file. No alteration is made to `large.mp3'.

You could then copy these five files onto four floppies (the last file fits on a floppy with one of the larger files), copy them all to your laptop, and then reconstruct the original file with cat (see Concatenating Text).

In this example, the rm tool is used to delete all of the split files after the original file has been reconstructed.

Comparing Files

There are a number of tools for comparing the contents of files in different ways; these recipes show how to use some of them. These tools are especially useful for comparing passages of text in files, but that's not the only way you can use them.

Determining Whether Two Files Differ

Use cmp to determine whether or not two text files differ. It takes the names of two files as arguments, and if the files contain the same data, cmp outputs nothing. If, however, the files differ, cmp outputs the byte position and line number in the files where the first difference occurs.

Finding the Differences between Files

Use diff to compare two files and output a difference report (sometimes called a "diff") containing the text that differs between two files. The difference report is formatted so that other tools (namely, patch---see Patching a File with a Difference Report) can use it to make a file identical to the one it was compared with.

To compare two files and output a difference report, give their names as arguments to diff.

The difference report is output to standard output; to save it to a file, redirect the output to the file to save to:

$ diff manuscript.old manuscript.new > manuscript.diff [RET]

In the preceding example, the difference report is saved to a file called `manuscript.diff'.

The difference report is meant to be used with commands such as patch, in order to apply the differences to a file. See Info file `diff.info', node `Top', for more information on diff and the format of its output.

To better see the difference between two files, use sdiff instead of diff; instead of giving a difference report, it outputs the files in two columns, side by side, separated by spaces. Lines that differ in the files are separated by `|'; lines that appear only in the first file end with a `<', and lines that appear only in the second file are preceded with a `>'.

To output the difference between three separate files, use diff3.

Patching a File with a Difference Report

To apply the differences in a difference report to the original file compared in the report, use patch. It takes as arguments the name of the file to be patched and the name of the difference report file (or "patchfile"). It then applies the changes specified in the patchfile to the original file. This is especially useful for distributing different versions of a file -- small patchfiles may be sent across networks easier than large source files.

Compressed Files

File compression is useful for storing or transferring large files. When you compress a file, you shrink it and save disk space. File compression uses an algorithm to change the data in the file; to use the data in a compressed file, you must first uncompress it to restore the original data (and original file size).

The following recipes explain how to compress and uncompress files.

Compressing a File

Use the gzip ("GNU zip") tool to compress files. It takes as an argument the name of the file or files to be compressed; it writes a compressed version of the specified files, appends a `.gz' extension to their file names, and then deletes the original files.

This command compresses the file `war-and-peace', putting it in a new file named `war-and-peace.gz'; gzip then deletes the original file, `war-and-peace'.

Decompressing a File

To access the contents of a compressed file, use gunzip to decompress (or "uncompress") it.

Like gzip, gunzip takes as an argument the name of the file or files to work on. It expands the specified files, writing the output to new files without the `.gz' extensions, and then deletes the compressed files.

This command expands the file `war-and-peace.gz' and puts it in a new file called `war-and-peace'; gunzip then deletes the compressed file, `war-and-peace.gz'.

NOTE: You can view a compressed text file without uncompressing it by using zless. This is useful when you want to view a compressed file but do not want to write changes to it. (For more information about zless, see Perusing Text).

File Archives

An archive is a single file that contains a collection of other files, and often directories. Archives are usually used to transfer or make a backup copy of a collection of files and directories -- this way, you can work with only one file instead of many. This single file can be easily compressed as explained in the previous section, and the files in the archive retain the structure and permissions of the original files.

Use the tar tool to create, list, and extract files from archives. Archives made with tar are sometimes called "tar files," "tar archives," or -- because all the archived files are rolled into one---"tarballs."

The following recipes show how to use tar to create an archive, list the contents of an archive, and extract the files from an archive. Two common options used with all three of these operations are `-f' and `-v': to specify the name of the archive file, use `-f' followed by the file name; use the `-v' ("verbose") option to have tar output the names of files as they are processed. While the `-v' option is not necessary, it lets you observe the progress of your tar operation.

NOTE: The name of this tool comes from "tape archive," because it was originally made to write the archives directly to a magnetic tape device. It is still used for this purpose, but today, archives are almost always saved to a file on disk.

See Info file `tar.info', node `Top', for more information about managing archives with tar.

Creating a File Archive

To create an archive with tar, use the `-c' ("create") option, and specify the name of the archive file to create with the `-f' option. It's common practice to use a name with a `.tar' extension, such as `my-backup.tar'.

Give as arguments the names of the files to be archived; to create an archive of a directory and all of the files and subdirectories it contains, give the directory's name as an argument.

This command creates an archive file called `project.tar' containing the `project' directory and all of its contents. The original `project' directory remains unchanged.

Use the `-z' option to compress the archive as it is being written. This yields the same output as creating an uncompressed archive and then using gzip to compress it, but it eliminates the extra step.

This command creates a compressed archive file, `project.tar.gz', containing the `project' directory and all of its contents. The original `project' directory remains unchanged.

NOTE: When you use the `-z' option, you should specify the archive name with a `.tar.gz' extension and not a `.tar' extension, so the file name shows that the archive is compressed. This is not a requirement, but it serves as a reminder and is the standard practice.

Listing the Contents of an Archive

To list the contents of a tar archive without extracting them, use tar with the `-t' option.

This command lists the contents of the `project.tar' archive. Using the `-v' option along with the `-t' option causes tar to output the permissions and modification time of each file, along with its file name -- the same format used by the ls command with the `-l' option (see Listing File Attributes).

Include the `-z' option to list the contents of a compressed archive.

Extracting Files from an Archive

To extract (or unpack) the contents of a tar archive, use tar with the `-x' ("extract") option.

This command extracts the contents of the `project.tar' archive into the current directory.

If an archive is compressed, which usually means it will have a `.tar.gz' or `.tgz' extension, include the `-z' option.

NOTE: If there are files or subdirectories in the current directory with the same name as any of those in the archive, those files will be overwritten when the archive is extracted. If you don't know what files are included in an archive, consider listing the contents of the archive first (see Listing the Contents of an Archive).

Another reason to list the contents of an archive before extracting them is to determine whether the files in the archive are contained in a directory. If not, and the current directory contains many unrelated files, you might confuse them with the files extracted from the archive.

To extract the files into a directory of their own, make a new directory, move the archive to that directory, and change to that directory, where you can then extract the files from the archive.

Tracking Revisions to a File

The Revision Control System (RCS) is a set of tools for managing multiple revisions of a single file.

To store a revision of a file so that RCS can keep track of it, you check in the file with RCS. This deposits the revision of the file in an RCS repository---a file that RCS uses to store all changes to that file. RCS makes a repository file with the same file name as the file you are checking in, but with a `,v' extension appended to the name. For example, checking in the file `foo.text' with RCS creates a repository file called `foo.text,v'.

Each time you want RCS to remember a revision of a file, you check in the file, and RCS writes to that file's RCS repository the differences between the file and the last revision on record in the repository.

To access a revision of a file, you check out the revision from RCS. The revision is obtained from the file's repository and is written to the current directory.

Although RCS is most often used with text files, you can also use it to keep track of revisions made to other kinds of files, such as image files and sound files.

Another revision control system, Concurrent Versions System (CVS), is used for tracking collections of multiple files whose revisions are made concurrently by multiple authors. While much less simple than RCS, it is very popular for managing free software projects on the Internet. See Info file `cvs.info', node `Top', for information on using CVS.

Checking In a File Revision

When you have a version of a file that you want to keep track of, use ci to check in that file with RCS.

Type ci followed by the name of a file to deposit that file into the RCS repository. If the file has never before been checked in, ci prompts for a description to use for that file; each subsequent time the file is checked in, ci prompts for text to include in the file's revision log (see Viewing a File's Revision Log). Log messages may contain more than one line of text; type a period (`.') on a line by itself to end the entry.

For example, suppose the file `novel' contains this text:

This is a tale about many things, including a long voyage across

This command deposits the file in an RCS repository file called `novel,v', and the original file, `novel', is removed. To edit or access the file again, you must check out a revision of the file from RCS with which to work (see Checking Out a File Revision).

Whenever you have a new revision that you want to save, use ci as before to check in the file. This begins the process all over again.

For example, suppose you have checked out the first revision of `novel' and changed the file so that it now looks like this:

This is a very long tale about a great many things, including my long
voyage across America, and back home again. 

If you create a subdirectory called `RCS' (in all uppercase letters) in the current directory, RCS recognizes this specially named directory instead of the current directory as the place to store the `,v' revision files. This helps reduce clutter in the directory you are working in.

If the file you are depositing is a text file, you can have RCS insert a line of text, every time the file is checked out, containing the name of the file, the revision number, the date and time in the UTC (Coordinated Universal Time) time zone, and the user ID of the author. To do this, put the text `$'Id$ at a place in the file where you want this text to be written. You only need to do this once; each time you check the file out, RCS replaces this string in the file with the header text.

For example, this chapter was written to a file, `managing-files.texinfo', whose revisions were tracked with RCS; the `$'Id$ string in this file currently reads:

$Id: managing-files.texinfo,v 1.32 2001/05/16 16:57:58 m Exp m $

Checking Out a File Revision

Use co to check out a revision of a file from an RCS repository.

To check out the latest revision of a file that you intend to edit (and to check in later as a new revision), use the -l (for "lock") option. Locking a revision in this fashion prevents overlapping changes being made to the file should another revision be accidentally checked out before this revision is checked in.

This command checks out the latest revision of file `novel' from the `novel,v' repository, writing it to a file called `novel' in the current directory. (If a file with that name already exists in the current directory, co asks whether or not to overwrite the file.) You can make changes to this file and then check it in as a new revision (see Checking In a File Revision).

You can also check out a version of a file as read only, where changes cannot be written to it. Do this to check out a version to view only and not to edit.

To check out the current version of a file for examination, type co followed by the name of the file.

This command checks out the latest revision of the file `novel' from the RCS repository `novel,v' (either from the current directory or in a subdirectory named `RCS').

To check out a version other than the most recent version, specify the version number to check out with the `-r' option. Again, use the `-l' option to allow the revision to be edited.

NOTE: Before checking out an old revision of a file, remember to check in the latest changes first, or they may be lost.

Viewing a File's Revision Log

Use rlog to view the RCS revision log for a file -- type rlog followed by the name of a file to list all of the revisions of that file.

$ rlog novel [RET] RCS file: novel,v Working file: novel head: 1.2 branch: locks: strict access list: symbolic names: keyword substitution: kv total revisions: 2; selected revisions: 2 description: The Great American Novel. ---------------------------- revision 1.2 date: 1991/06/20 15:31:44; author: leo; state: Exp; lines: +2 -2 Second draft. ---------------------------- revision 1.1 date: 1991/06/21 19:03:58; author: leo; state: Exp; Initial revision ==================================================================== $

This command outputs the revision log for the file `novel'; it lists information about the RCS repository, including its name (`novel,v') and the name of the actual file (`novel'). It also shows that there are two revisions -- the first, which was checked in to RCS on 20 June 1991, and the second, which was checked in to RCS the next day, on 21 June 1991.

[<--] [Cover] [Table of Contents] [Concept Index] [Program Index] [-->]