March 1994/Code Capsules

Code Capsules

The Preprocessor

Chuck Allison

To use C effectively you really have to master two languages: the C language proper and the preprocessor. Before a compiler begins the usual chores of syntax checking and instruction translation, it submits your program to a preliminary phase called preprocessing, which alters the very text of the program according to your instructions. The altered text that the compiler sees is called a translation unit. In particular, the preprocessor performs the following three functions for you:
1) header/source file inclusion
2) macro expansion
3) conditional compilation
In this article I will illustrate these features of the preprocessor.

The Include Directive
One of the first source lines any C programmer sees or composes is this:

#include <stdio.h>
Take a moment right now and jot down everything you know about this statement.
Let's see how you did. stdio.h is of course a standard library header, so called because such include directives usually appear near the beginning of a source file so that their definitions will be in force throughout the rest of the compilation. We commonly think of it as a header file, but there is no requirement that the definitions and declarations pertaining to standard input and output reside in a file. The C Standard only requires that these definitions and declarations replace the include directive in the text of the program before translation. They could reside in tables internal to the preprocessor. Most compiler implementations do supply header files for the standard library, however. MS-DOS compilers install header files in a suitable subdirectory. Here is a sampling:

\BC4\INCLUDE /* Borland C++ 4.0 */ \MSVC\INCLUDE /* Microsoft Visual C++ */ \WATCOM\H /* Watcom C/C++ */
On UNIX systems you will find header files in /usr/include.
Since an implementation is not even obliged to supply headers in the form of physical files, it's no surprise that those implementations providing files don't always give them the same name as the include directive. After all, how could a compiler supply a file named stdio.h on a platform whose file system didn't allow periods in a file name? On MS-DOS systems there can be no file that exactly matches the C++ header <strstream.h>, because the file system only allows up to eight characters before the period.
Most MS-DOS implementations map header names into file names by truncating the base part (the portion before the period) to eight characters, and the extension to three (so the definitions for <strstream.h> reside in the file STRSTREA.H). A standard-conforming implementation must supply a mapping to the local file system for user-defined header names having at least six characters before the period and one character after.
Conforming compilers also support include directives with string-like arguments, as in:

#include "mydefs.h"
The string must represent a name recognized by the local file system. The file must be a valid C/C++ source file and, like the standard headers, usually contains function prototypes, macro definitions, and other declarations. An implementation must specify the mechanism it uses to locate the requested source file. On platforms with hierarchical file systems, the compiler usually searches the current directory first. If that fails, it then searches the subdirectory reserved for the standard headers. Because standard header names are special preprocessing tokens and not strings, any backslashes in a header name are not interpreted as escape characters. In the following directive, a double backslash is not needed.

#include <sys\stat.h> /* \, not \\ */ #include "\project\include\mydefs.h"
Included files may themselves contain other include directives, nested up to eight levels deep. Since some definitions (like typedefs) must only appear once during a compilation, you must guard against the possibility of a file being included more than once. The customary technique defines a symbol associated with the file. Exclude the text of the file from the compilation if the symbol has already been seen by the compiler, as in the following:

/* mydefs.h */ #ifndef MYDEFS_H #define MYDEFS_H <declarations/definitions go here> #endif

Macros
As you can see, there's more to the #include directive than meets the eye. C provides eleven other preprocessor directives you can use to alter your source text in meaningful ways (see Table 1) . (All begin with the '#' character, which must be the first non-space character on its source line.) In this section I elaborate on one of the other directives, the #define directive, to introduce a very useful construct called a macro.
The #define directive creates macro definitions. A macro is a name for a sequence of zero or more preprocessing tokens. (Valid preprocessing tokens include valid C language tokens such as identifiers, strings, numbers and operators; and any single character). For example, the line
#define MAXLINES 500
associates the text 500 with the symbol MAXLINES. The preprocessor keeps a table of all symbols created by the #define directive, along with the corresponding replacement text. Whenever the preprocessor encounters the token MAXLINES outside of a quoted string or comment, it replaces MAXLINES with the token 500. In later phases of compilation it appears as if you actually typed 500 instead of MAXLINES. It is important to remember that this operation consists of mere text replacement. No semantic analysis occurs during preprocessing.
A macro without parameters, such as MAXLINES, is sometimes called an object-like macro because it defines a program constant that looks like an object. Because object-like macros are often constants, it is customary to type them in upper case as a hint to the reader. You can also define function-like macros with zero or more parameters, as in the following code fragment:
#define beep()   putc('\a' ,stderr)
#define abs(x)   ((x) >= 0 ? (x) : (-(x)))
#define max(x,y) (((x) > (y)) ? (x) : (y))
There must be no whitespace between the macro name and the first left parenthesis. The expression
abs(-4)
expands to
((-4) >= 0 ? (4) : (-(4)))
You should always parenthesize macro parameters (such as x) in the replacement text. This practice prevents surprises from unexpected precedence in complex expressions. For example, if you had used the following naive mathematical definition for absolute value:
x >= 0 ? x : -x
then the expression abs(a - 1) would expand to
a - 1 >= 0 ? a - 1 : -a - 1
which is incorrect when a - 1 < 0 (it should be -(a - 1)).
Even if you put parentheses around all arguments, you should usually parenthesize the entire replacement expression as well to avoid surprises with respect to the surrounding text. To see this, define abs() without enclosing parens, as in:
#define abs(x) (x) >= 0 ? (x) : (-x)
Then abs(a) - 1 expands to
(a) >= 0 ? (a) : -(a) - 1
which is incorrect for non-negative values of x.
It is also dangerous to use expressions with side effects as macro arguments. For example, the macro call abs(i++) expands to
((-i++) >= 0 ? (i++) : (-(i++)))
No matter what the value of i happens to be, it gets incremented twice, not once, which probably isn't what you had in mind.

Pre-defined Macros
Conforming implementations supply the five built-in object-like macros shown in Table 2. The last three macros remain constant during the compilation of a source file. Any other pre-defined macros that a compiler provides must begin with a leading underscore followed by either an uppercase letter or another underscore.
You may not redefine any of these five macros with the #define directive, nor remove them with the #undef directive. Most compilers support multiple modes, some of which are not standard-conforming. (To guarantee that the sample program in Listing 1 will run correctly under Borland C, for example, you need to run in "ANSI mode" via the "-A" commandline option.)
Conforming compilers also provide a function-like macro, assert, which you can use to put diagnostics in programs. If its argument evaluates to zero, assert prints the argument along with source file name and line number (using __FILE__ and _LINE_) to the standard error device and aborts the program (see Listing 2) . For more information on using the assert macro, see the Code Capsule "File Processing, Part 2" in the June 1993 issue of CUJ.
A compiler is allowed to provide macro versions for any functions in the standard library (getc and putc usually come as macros for efficiency). With the exception of a handful of required function-like macros (assert, setjmp, va_arg, va_end, and va_start), an implementation must also supply true function versions for all functions in the standard library. A macro version of a library function in effect hides its prototype from the compiler, so its arguments are not type-checked during translation. To force the true function to be called, remove the macro definition with the #undef directive, as in
#undef getc
Alternatively, you can surround the function name in parentheses when you call it, as in:
c = (getc)(stdin);
There's no danger of this expression matching the macro definition since a left parenthesis does not immediately follow the function name.

Conditional Compilation
You can selectively include or exclude segments of code with conditional directives. For example, you can embed the following excerpt in your code to accommodate different syntaxes of the delete operator in earlier versions of C++:

#if VERSION < 3 delete [strlen(p) + 1] p; #else delete [] p; #endif
Your compiler probably supplies a macro similar to VERSION (Borland C++ defines __BCPLUSPLUS__ , Microsoft _MSCVER). The argument of an #if directive must evaluate to an integer constant, and obeys the usual C rule of non-zero means true, zero false. You cannot use casts or the sizeof operator in such expressions.
C++ implementations also pre-define the macro __cplusplus, which you can use to customize your code for mixed C/C++ environments. For example, if you want to link with existing C code in a C++ environment, you need to use the extern "C" linkage specification (which of course is not valid in a C environment). The following excerpt will do the right thing in either environment:

#ifdef __cplusplus extern "C" { #endif <put C declarations here> #ifdef __cplusplus #endif
The #if directive is handy when you want to comment out long passages of code. You can't just wrap such sections in a single, enclosing comment because there are likely to be comments in the code itself (right?), causing the outer comment to end prematurely. It is better to enclose the code in question in the body of an #if directive that always evaluates to zero:

#if 0 <put code to be ignored here> #endif

Preprocessor Operators
Sometimes you just want to know if a macro is defined, without using its value. For example, if you only support two compilers, you might have something like the following in your code:

#if defined _MSCVER <put Microsoft-specific statements here> #elif defined __BCPLUSPLUS__ <put Borland-specific statements here> #else #error Compiler not supported. #endif
defined is one of three preprocessor operators (see Table 3) . The defined operator evaluates to 1 if its argument is present in the symbol table, meaning that the macro was either defined by a previous #define directive or the compiler provided it as a built-in macro. The #error directive prints its argument as a diagnostic and halts the translator.
It isn't necessary to assign a value to a macro. For example, to insert debug trace code into your program, you can do the following:
#if defined DEBUG
fprintf(stderr,"x = %d\n",x);
#endif
To define the DEBUG macro, just insert the following statement before the first use of the macro:
#define DEBUG
The following equivalences are recognized by the preprocessor:
#if defined X    <==>      #ifdef X
#if !defined X   <==>      #ifndef X
Using the defined operator is more flexible than the equivalent directives on the right because you can combine multiple tests as a single expression, as in:
#if defined _cplusplus && !defined DEBUG
The operator #, the "stringizing" operator, effectively encloses a macro argument in quotes. As the program in Listing 3 illustrates, stringizing can be useful for debugging. The trace() macro encloses its arguments in quotes so they become part of a printf format statement. For example, the expression trace(i,d) becomes
printf("i" " = %" "d" "\n",i);
and, after the compiler concatenates adjacent string literals it sees this:
printf("i = %d\n",i);
There is no way to build quoted strings like this without the stringizing operator because the preprocessor ignores macros inside quoted strings.
The token-pasting operator, ##, concatenates two tokens together to form a single token. The call trace2(1) in Listing 4 is translated into
trace(x1,d)
Any space surrounding these two operators is ignored.

Implementing a s s e r t ( )
Implementing assert reveals an important fact about using macros. Since the action of assert depends on the result of a test, you might first try an if statement, as in:

#define assert(cond) \ if (!(cond)) __assert(#cond,__FILE__,__LINE__)
where the function __assert prints the message and halts the program. This implementation causes a problem, however, when assert finds itself within an if statement, as in:

if (x > 0) assert(x != y) else /* whatever */
because the preceding code expands into

if (x > 0) if (!(x != y))_assert("x != y","file.c",7); else /* whatever */
The indentation that results from expanding assert in place is misleading because it's actually the second if that intercepts the else. Rewriting the expanded code to represent the actual flow of control produces:

if (x > 0) if (!(x != y)) __assert("x != y","file.c",7) else /* OOPS! New control flow! */ /* whatever */
The usual fix for nested if problems such as this is to use braces, as in:

#define assert(cond) \ {if (!(cond))_assert (#cond,__FILE,__LINE__) }
but this code expands into

if (x > 0) {if (!(x != y)) _assert ("x != y","file.c",7)}; else /* whatever */
and the combination }; in the second line creates a null statement that completes the outer if, leaving a dangling else, which is a syntax error. A correct way to define assert is shown in Listing 5. (This simple version does not recognize the macro NDEBUG.) (Listing 6 shows the implementation of the support function __assert()). In general, when a macro must make a choice, it is good practice to write it as an expression and not as a statement.

Macro Magic
It's important to understand precisely what steps the preprocessor follows to expand macros, otherwise you can be in for some mysterious surprises. For example, if you insert the following line near the beginning of Listing 4:
#define x1 SURPRISE!
then trace2(1) expands into
trace(x ## 1,d)
which in turn becomes
trace(x1,d)
But the preprocessor doesn't stop there. It rescans the line to see if any other macros need expanding. The final state of the program text seen by the compiler is shown in Listing 7.
To further illustrate, consider the text in Listing 8. Listing 8 is not a complete program, by the way, but is for preprocessing only — don't try to compile it all the way. (If you have Borland C use the CPP command.) The output from the preprocessor appears in Listing 9. The str() macro just puts quotes around its argument. It might appear that xstr() is redundant, but there is an important difference between xstr() and str(). The output of the statement str(VERSION) is of course
"VERSION"
but xstr(VERSION) expands to
str(2)
because arguments not connected with a # or ##are fully expanded before they replace their respective parameters. The preprocessor then rescans the statement, providing "2". So in effect, xstr() is a version of str() that expands its argument before quoting it.
The same relationship exists between glue() and xglue(). The statement glue(VERSION,3) concatenates its arguments into the token VERSION3, but xglue(VERSION,3) first expands VERSION, producing
glue(2,3)
which in turn rescans into the token 23.
The next two statements are a little trickier:
       glue(VERS,ION)
       == VERS ## ION
       == VERSION
       == 2
and
       xglue(VERS,ION)
       == glue(VERS,ATILE)
       == VERS ## ATILE
       == VERSATILE
Of course, if VERSATILE were a defined macro it would be furher expanded.
The last four statements in listing 8 expand as follows:
       ID(VERSION)
       == "This is version "xstr(2)
       == "This is version "str(2)
       == "This is version ""2"
       
       INCFILE(VERSION)
       == xstr(glue(version,2)) ".h"
       == xstr(version2) ".h"
       == "version2" ".h"
       
       str(INCFILE(VERSION))
       == #INCFILE(VERSION)
       == "INCFILE(VERSION)"
       
       xstr(INCFILE(VERSION))
       == str("version2" ".h")
       == #"version2" ".h"
       == "\"version2\" \".h\""
For obvious reasons, the # operator effectively inserts escape characters before all embedded quotes and backslashes.
The macro replacement facilities of the preprocessor clearly offer you an incredible amount of flexibility (too much, some would say). There are two limitations to keep in mind:
1) If at any time the preprocessor encounters the current macro in its own replacement text, no matter how deeply nested in the process, the preprocessor does not expand it but leaves it as-is (otherwise the process would never terminate!). For example, given the definitions
   #define F(f) f(args)
   #define args a,b
F(g) expands to g(a,b), but what does F(F) expand to? (Answer: F(a,b)).
2) If a fully-expanded statement resembles a preprocessor directive, (e.g., if expansion results in an #include directive), the directive is not invoked, but is left verbatim in the program text. (Thank goodness!).

Character Sets and Trigraphs
The character set you use to compose your program doesn't have to be the same as the one in which the program executes. These two character sets often differ in non-English applications. A C translator only understands the source character set — English alphanumerics, the graphics characters used for operators and punctuators (there are 29 of them), and a few control characters (newline, horizontal tab, vertical tab, and form-feed). Any other characters presented to the translator may appear only in quoted strings, character constants, header names or comments. The execution character set is the set of characters that the program uses in its literals, and to input and output data. This set is implementation-defined, but must at least contain characters representing alert ('\a'), backspace ('\b'), carriage return ('\r'), newline ('\n'), form feed ('\f'), vertical tab ('\v'), and horizontal tab ('\t').
Many non-U.S. environments use different graphics for some of the elements of the source character set, making it impossible to write readable C programs. To overcome this obstacle, standard C defines a number of trigraphs, which are triplets of characters from the Invariant Code Set (ISO 646-1983) found in virtually every environment in the world. Each trigraph corresponds to a character in the source character set which is not in ISO 646 (see Table 4) . For example, whenever the preprocessor encounters the token ??= anywhere in your source text (even in strings), it replaces this token with the '#' character code from the source character set. The program in Listing 11 shows how to write the "Hello, world!" program from Listing 10 using trigraphs. (Borland users: you have a separate executable, trigraph.exe, for procesing trigraphs.)
In an effort to enable more readable programs world-wide, the C++ draft standard defines a set of digraphs and new keywords for non-ASCII developers (see Table 5) . Listing 12 shows what "Hello, world" looks like using these new tokens. Perhaps you will agree that the symmetric look of the bracketing operators is easier on the eye.

Phases Of Translation
The C standard defines eight distinct phases of translation. An implementation doesn't make eight separate passes through the code, of course, but the result of translation must behave as if it had. The eight phases are:
1. Physical source characters are mapped into the source character set. This includes trigraph replacement and things like mapping a carriage return/line feed to a single newline character in MS-DOS.
2. All lines that end in a backslash are merged with their continuation line, and the backslash is deleted.
3. The source is parsed into preprocessing tokens and comments are replaced with a single space character. The C++ digraphs are recognized as tokens.
4. Preprocessing directives are invoked and macros are expanded. Steps 1 through 4 are repeated for any included files.
5. Escape sequences in character constants and string literals that represent characters in the execution set are converted (e.g., '\a' would be converted to a byte value of 7 in an ASCII environment).
6. Adjacent string literals are concatenated.
7. Traditional compilation occurs: lexical and semantic analysis, and translation to assembly or machine code.
8. Linking occurs: external references are resolved and a program image is made ready for execution.
The preprocessor performs steps 1 through 4.

C++ And The Preprocessor
C++ preprocessing formally differs from that of C only in the tokens it recognizes. A C++ preprocessor must recognize the tokens in Table 5 as well as .*, ->*, and ::. It must also recognize //-style comments and replace them with a single space. Though C++'s preprocessor isn't much different than C's, you may want to use it a lot differently. For example, as far as I can tell, there is no good reason to define object-like macros anymore. You should use const variable definitions instead. The statement
const int MAXLINES = 500;
has a couple of advantages over
#define MAXLINES 500
Since the compiler knows the semantics of the object, you get stronger compile-time type checking. You can also reference const objects like any other with a symbolic debugger. Global const objects have internal linkage unless you explicitly declare them extern, so you can safely replace all your object-like macros with const definitions.
Function-like macros are almost unnecessary in C++. You can replace most function-like macros with inline functions. For example, replace the max macro as shown previously with
inline int max(int x, int y)
{
    return x >= y ? x : y;
}
Note that you don't have to worry about parenthesizing to avoid precedence surprises, because this code defines a real function, with scope and type checking. You also don't have to worry about side effects like you do with macros, such as in the call
max(x++,y++)
The macro version may seem superior to the inline function because it accepts arguments of any type. No problem. Define max as a template, as in the following code; now the inline function will accept arguments of any type:
template<class T>
inline int max(const T& x, const T& y)
{
    return x > y ? x : y;
}
Do keep in mind, however, that inline is a only hint to the compiler. Not all functions are amenable to inlining, especially those with loops and complicated control structures. Your compiler may tell you that it can't inline a function. Still, in many cases it is better to define a function out-of-line than to define it as a macro and lose the type safety that a real function affords.
There is still room in C++ for function-like macros that use the stringizing or token-pasting operators. The program in Listing 13 uses stringizing and an inline function to test the new string class available with Borland C++ 4.0.

Conclusion
The preprocessor doesn't know C or C++. It is a language all its own. Many library vendors have used the preprocessor intelligently to simplify the installation and use of their products. I encourage you to use it, but to use it prudently. It has some dark corners, which I've purposely avoided. It is good practice, especially with C++, to do as much as you can in the programming language, and use the preprocessor only when you need to.