Code Capsules

File Processing, Part 2

Chuck Allison


Chuck Allison is a software architect for the Family History Department of the Church of Jesus Christ of Latter Day Saints Church Headquarters in Salt Lake City. He has a B.S. and M.S. in mathematics, has been programming since 1975, and has been teaching and developing in C since 1984. His current interest is object-oriented technology and education. He is a member of X3J16, the ANSI C++ Standards Committee. Chuck can be reached on the Internet at allison@decus.org, or at (801)240-4510.

Portability with POSIX

In the early 1980s an organization called /usr/group (the UNIX Users' association, now called Usenix) began an effort to standardize both C and the C programming environment. Much of that they defined applied to all environments and became the basis for the Standard C library. Another part of their work resulted in a set of functions to access UNIX system services (UNIX was the most common C programming environment at the time). These functions constitute the C language bindings of what now is called POSIX (Portable Operating System Interface). Fortunately, many environments (including MS-DOS and of course, the many flavors of UNIX) provide most or all of these functions. POSIX compliance, therefore, can be important when moving applications from one platform to another. Here is a simple recipe for maximizing portability:

1. Program in Standard C.

2. If step 1 is too restrictive, use only POSIX-compliant functions.

3. If steps 1 and 2 aren't possible, isolate system-dependent code in separate modules. This will minimize how much code you'll have to rewrite when porting to another system. There are cross-platform tools available that do some of this work for you.

There is of course much more to POSIX than C language bindings (bindings for other languages and a specification for a command shell, for example). This month's article illustrates most of the POSIX functions that pertain to file processing.

I tested Listing 1, Listing 2, and Listing 3 with Borland C++ 3.1, Zortech C 3.0, Microsoft C 6.00A, and Mix Software's POWERC 2.20. (If you use Microsoft C/C++ 7.0, you must compile with OLDNAMES. LIB.) Of these four compilers, only Borland's supports the directory I/O functions required for Listing 4, Listing 5, Listing 6, and Listing 7. All seven listings should run on any UNIX platform (with some obvious modifications as described in the comments).

POSIX File I/O

The definition of the FILE structure in your stdio.h include file probably contains an integer that represents a file handle (or file discriptor). (Look for a member named something like fd, _file, or fildes). A file handle, a unique, non-negative integer assigned by the file system, identifies an access path into a file. POSIX file-access functions use file handles instead of file pointers to perform basic operations comparable to those offered in the Standard C library, but with typically less overhead. Table 1 contains a comparison of POSIX and Standard C functions. Other POSIX file access-functions are listed in Table 2.

Copying Files

cat.c (Listing 1) copies the files indicated on the command line to standard output. For example, the command

   cat file1 file2 >file3
combines file1 and file2 into a new file, file3. This program uses Standard C functions for reading and writing. The only POSIX functions are in the line

   FILE *std_out = fdopen (fileno(stdout), "wb");
This line enables writing to standard output in binary mode, in case one of the user files is not a text file. (See last month's "Code Capsule" for a discussion of binary mode.) It associates a new file pointer with standard output, without creating a new handle. (In other words, the file pointers stdout and std_out share the same file handle.)

The function filecopy (Listing 2) opens an input and an output file in binary mode. The open system call returns -1 if the open fails. (In fact, most POSIX functions return -1 upon failure.) fcntl.h defines the flags used to define INPUT_MODE and OUTPUT_MODE. The third argument in the open of the output file specifies that if the file doesn't already exist, the newly created file should not be write-protected. (sys/stat.h defines S_IWRITE.) Any include file prefixed with sys/ is a POSIX include file (although many, such as io.h and fcntl.h, have no prefix). On most systems, you need to include sys/types.h before sys/stat.h. The read and write functions both return the number of bytes transferred.

The program cp.c in Listing 3 uses filecopy to copy one or more files to a given directory. The stat function fills a structure with basic file information, including these members:

   set_mode    file mode (directory indicator, file permissions)
   st_size     file  size in bytes
   st_mtime    time of last data modification
Testing st_mode with the mask S_IFDIR (from sys/stat.h) determines whether the file is a directory. The forward slash character is a directory separator for pathnames in all POSIX systems. (Note the sprintf statement in function cp.) Only the command line of the MS-DOS shell (COMMAND. COM) requires a backslash character as the separator character. Both slash characters are totally interchangeable within MS-DOS programs. For maximum portability, the names of files and directories should only use characters from the portable filename character set: alphanumerics, the period, and the underscore.

Reading Directory Entries

The most widely-used operating systems that support C development today have a hierarchical directory structure. POSIX defines functions to create, delete, and navigate among directories, as well as functions to read the entries in a directory (see Table 3) . The program list.c in Listing 4 prints a listing of the current directory to standard output. To read a directory, you must first get a pointer to a DIR structure with the opendir function. Successive calls to readdir return a pointer to a struct dirent structure (which contains the entry name) or NULL when all entries have been read. These structures and functions are declared in dirent.h.

As shown in Figure 1, list displays the name, permissions, size, and time of last modification of each file. The characters in the permissions column mean

The file system creates the first two entries: "." refers to the current directory and ".." to its parent. You can't alter them directly (hence no w permission). The token / in POSIX functions (and \ in MS-DOS's) refers to the root directory. Since the modification time is a standard time value (time_t), I use the Standard C function ctime to display it. (See the "Code Capsule" in the January 1993 issue for a discussion of time and date functions).

The program findfile.c in Listing 5 searches a directory tree for all occurrences of a specified entry. On my MS-DOS system, the command

   findfile himem.sys \
searches the entire disk for the file himem.sys, and prints

   \dos\himem.sys
   \windows\himem.sys
To restrict the search to a specific directory tree, change the second argument, for example, to

   findfile himem.sys \dos
To make a UNIX version of this utility, replace the defined constants with

   #define SEPSTR "/"
   #define OFFSET 0
OFFSET skips over the disk identifier and colon (e.g., C:) that precede the full pathname returned by getcwd under MS-DOS. The function call getcwd(NULL,FILENAME_MAX) returns a pointer to a dynamically-allocated string that represents the current working directory (FILENAME_MAX is defined in stdio.h). If access(file,0) returns 0, then file exists. (file can also be a directory name.) chdir(dir) makes dir the current working directory. visit_dirs, a recursive function that visits all subdirectories in a directory tree, restores the original working directory when it returns.

Redirecting Standard Error

When you enter a command such as

   cat file1 file2 >file3
the command shell disconnects the internal file handle for standard output from the console and connects it to file3 before it loads the program cat. When cat terminates, the shell reconnects the handle to the console.
Listing 6 shows how to do the same thing with standard error. The function redir_stderr gets a handle to the new destination by calling open. Then it creates a new handle to the original destination with dup (this is for restoring it later). Finally, it redirects the output by disconnecting it from the original destination and connecting it to the new one with dup2.

When you don't need the redirection anymore, call restore_stderr, redirects standard error back to its original destination and discards the duplicate handle.

A handle created with dup is "synchronized" with the original, so that if you change file position with lseek on one handle, the position is updated for the other handle also.

You may be wondering why you can't just use a call to freopen to redirect standard error. This works fine within a single program, but freopen has no effect if you initiate another program from within your program (using system, say), because only the local file pointer changes. To have such changes persist across subprocesses you must use dup2.

The program ddir.c in Listing 7 illustrates most of the concepts mentioned in this article. It deletes an entire directory tree by following these steps:

1. Make the root of the tree the current working directory.

2. Delete all files within that directory with a shell command (in this case the MS-DOS del command).

3. Any entry left in the directory is either a protected file or a subdirectory.

For a protected file, lower the protection with chmod and delete the file explicitly with unlink. (You can use remove in MS-DOS, but you shouldn't in UNIX; unlink works for both.)

For a subdirectory, recursively repeat the whole process starting with step 1) on the subdirectory.

4. Ascend to the parent directory and delete the directory in question with rmdir.

All three of standard input, standard output, and standard error are redirected by the time the shell command passed to system is executed. When you issue a shell command to delete all files in a directory (del *. *), MS-DOS prompts you for a confirmation (Y or N). To bypass this prompt, I redirect standard output to the null device (the "bit bucket") so it doesn't appear, and create a file with the letter Y in it and redirect standard input to come from that file, so there is no pause for console input. If any files can't be deleted (or if there are none to delete), MS-DOS sends a message to standard error. I redirect standard error to the null device before calling system so these messages also will not appear.

The functions in this article give you the control over your environment which is critical for robust applications. While these techniques aren't universally portable, they apply to the many platforms that are POSIX compliant (or soon will be, e.g., OpenVMS and Windows NT).

Sidebar: "Defensive Programming with assert"