Code Capsules

File Processing

Chuck Allison


File systems differ greatly from one environment to another. There is no universal approach to issues such as directory structure, components of, length of, and acceptable characters for filenames, file modes (e.g., block vs. stream, text vs. binary), file versioning, or file locking. This sometimes makes it difficult to write a portable program that uses files. This capsule will illustrate some of the most commonly-used file I/O functions from the Standard C library. The examples below should work on most operating systems. (You may have to substitute certain keystrokes for the ones mentioned here.)

Text Filters

The following short program copies a line at a time from standard input to standard output:

/* copy1.c */

#include <stdio.h>

main()
{
char s[BUFSIZ];

while (gets (s))
    puts(s);
return 0;
}
Unless you instruct otherwise, the standard I/O functions perform buffered I/O, that is, they collect data and then transfer it a buffer at a time for efficiency. BUFSlZ, defined in stdio.h (as 512 by my compiler), is the size of the internal buffers used by the standard I/O functions. BUFSIZ is a good choice for the size of your buffers if you don't have a good reason to choose differently.

On operating systems like MS-DOS and UNIX, this program is more interesting than it appears. In conjunction with redirection, it can be used to actually create a file:

C:> copy1 >file1
After entering the lines of text, enter Ctrl-Z (in MS-DOS, Ctrl-D in UNIX) on a line by itself to signal the end of input. To make a copy of an existing file, enter:

C:> copy1 <file1 >file2
Such a program that reads only from standard input and writes only to standard output is called a filter. It is also possible to redirect I/O from within the program itself via the freopen library function, which disconnects an open file pointer from its file and connects it to a new one.

The program in Listing 1 disconnects stdin and/or stdout from the console and connects them to the files entered on the command line. You can invoke this program without explicit redirection

C:> copy2 file1 file2
The names of the files become arguments to main (see the box called "Command-Line Arguments"). Using freopen is convenient because it avoids explicit opening and closing of files, but it disallows any interaction with the user (since standard I/O has been redirected to files). Most interactive applications, however, require both file and console I/O. The version in Listing 2 shows how to open files explicitly — no redirection is performed, and both filenames are required.

The function fgets needs to know the size of your buffer, and it places the newline character into the buffer if there is room for it. (gets discards the newline.) fputs, therefore, doesn't append a newline to the string it writes (puts does). Using fgets with puts produces double-spaced output, while mising gets with fputs prints everything on one line.

Both fgets and gets return NULL upon end-of-file or error; any additional error-checking is not usually required. You should check for output errors, however, especially on a PC system where running out of disk space is not uncommon. You do this with a call to ferror. For example, you should replace the while loop in Listing 2 with the following:

 /* copy4.c */
...
while (fgets (buf,BUFSIZ,inf))
{
   fputs (buf,outf);
   fflush (outf);
   if (ferror(outf))
      return EXIT_FAILURE;
}
...
Since file I/O is buffered, you should flush the output buffer before checking in case there is a disk overflow error. Once the error state of a file is set, it remains unchanged until you reset it by calling clearerr or rewind.

Binary Files

The examples so far work only with text files (i.e., files of lines delimited by \n). In order to be able to copy any file (e.g., an executable program) on most non-UNIX systems, it is necessary to open the file in binary mode. In text mode under MS-DOS (the default file mode there), each newline character in memory is replaced with a \r\n pair (CR/LF) on the output device, while the process is reversed during input. (This is a carry-over from CP/M days.) In addition, a Ctrl-Z is interpreted as end-of-file, so it is impossible to read past a Ctrl-Z in text mode. Binary mode means that no such translations are made — the data in memory and on disk are the same. (NOTE: MS-DOS text and binary modes are not analogous to the cooked and raw modes of UNIX; binary mode has no effect in UNIX). The program in Listing 3 can copy any type of file.

A b appended to the normal open mode indicates binary mode. The functions fread and fwrite read and write blocks of data. They both return the number of blocks (not bytes) successfully processed. In the example in Listing 3, the items just happen to be bytes, i.e., blocks of length 1. When fwrite returns a number less than the number of items requested, you know that a write error has occurred (so an explicit call to ferror is not necessary). fwrite stores numeric data in binary on the output device — so you cannot read it with normal text utilities.

Record Processing

The functions fread and fwrite are suitable for processing files of fixed-length records. The program in Listing 4 populates a file from keyboard input (terminated by Ctrl-z) and then randomly accesses certain records. I use stderr for printing prompts since it is always attached to the console and is unbuffered on most systems. Figure 1 contains the results of a sample execution.

A + (plus sign) in the open mode request indicates update mode, which means that both input and output are allowed on the file. You must separate input and output operations, however, with a call to fflush or to some file positioning command, such as fseek or rewind. fseek positions the read/write cursor a given number of bytes from the beginning of the file (SEEK_SET), the end of the file (SEEK_END), or from the current position (SEEK_CUR). rewind(f) is equivalent to fseek(f,OL,SEEK_SEF). Arbitrary byte positions passed to fseek only make sense in binary mode, since in text mode there may be embedded characters you know nothing about. The function ftell returns the current position in a file, which value can be passed to fseek to return to that position (this synchronized use of fseek and ftell works even in text mode). Since fseek and ftell take a long integer argument for the file position, they are limited in the size of file they can correctly traverse. If your system supports larger file position values, then use fgetpos and fsetpos instead. fsetpos is only guaranteed to work correctly for values returned by fgetpos, not for arbitrary integral values.

The program in Listing 5 puts fgetpos and fsetpos to good use in a simple 4-way scrolling browser for large files. It only keeps one screenful of text in memory. If you want to scroll up or down through the file, it reads (or re-reads) the adjacent text and displays it. When scrolling down (i.e., forward) through the file, the file position of the data on the screen is pushed on a stack, and the program reads the next screenful from the current file position. To scroll up, it retrieves the file position of the previous screen from the stack (see Listing 6) . Although this is the crudest of algorithms for viewing text, it can view a file of any size (if you make the stack large enough), and performance is acceptable on systems that cache disk operations (really!). The display mechanism is also crude but fairly portable — it uses ANSI terminal escape sequences for clearing the screen and positioning the cursor (see Listing 7 — if you're using MS-DOS, you must load ANSI.SYS from your CONFIG. SYS file). You could customize this program into an efficient tool by adding buffering and fast screen writes.

Temporary Files

When your program requires a scratch file for temporary processing, you need a unique name for that file. If the name isn't important to you, let tmpnam do the work of creating the filename for you:

char fname[L_tmpnam];
tmpnam(fname);
f = fopen(fname ....
tmpnam will supply at least TMP_MAX unique names before it starts repeating. The macros L_tmpnam and TMP_MAX are defined in stdio.h. Don't forget to delete the file before the program terminates:

remove (fname);
If you don't need to know the name of the file, but just want access to it, a better approach most of the time is to let tmpfile give you a file pointer to a temporary file. It returns a pointer to a file opened with mode wb+ (this is usually adequate for scratch files). The best part is that the file is deleted automatically when the program terminates normally (i.e., abort isn't called).

Listing 8 contains the program esort.c, an external sort program (i.e., a program that can sort files larger than can fit in memory). When a file larger than 1000 lines or that can fit in available memory is read, esort breaks it into subfiles. Each subfile is sorted internally with qsort, and then the sorted subfiles are merged to standard output (see last month's capsule for a discussion of qsort). The subfiles disappear automatically when the program halts. If for some reason the program aborts (i.e., one of the assertions failed — a write error, for example), the subfiles are not deleted and you can examine them for clues to what went wrong.

Although file systems vary widely, many common operations can be done portably. The examples in this article (keyboard signals excepted) should work unchanged in any ANSI-C environment. Next month I'll examine useful features found in UNIX-compatible environments.