Code Capsules

Text Processing I: The Finer Points of scanf

Chuck Allison


Welcome to Code Capsules! You've had some exposure to C, perhaps a little training (whether formal or self-taught), and now you're ready to put the world's most popular programming language to work. All you need is a little "culture". This column will increase your mastery of the idioms, conventions, and philosophy which make programming in C (and C++) so productive and enjoyable.

Style and usage are best communicated by example, hence each "capsule" emphasizes code — short programs or program excerpts that illustrate techniques used by professionals that are not often taught in a formal setting.

An important prerequisite to the mastery of C is a thorough understanding of the standard C library. Often, novices duplicate in their own code much of what has already been developed and standardized for them. Therefore, these capsules make extensive use of the standard C library.

scanf

Most functions in the standard C library are engineered to do one simple task well. An exception to this rule is scanf. It attempts to handle most of what needs to be done with the input of text, which is no simple task. Although it is usually one of the first functions a C programmer learns to use, it is among the last to be mastered. This capsule illustrates some of the finer points of scanf.

How scanf Works

Since most input items are separated with whitespace, scanf consumes runs of optional whitespace for each space in its format string. (There is never a need to have two consecutive space characters in a scan format). For example, the statements

int n;
char c;
scanf("%d %c",&n,&c);
read an integer, followed by any amount of whitespace (including none at all), and finally a single, non-whitespace character. (The scan is not completed until a non-whitespace character is found.) This means that whether the input stream contains

123a
or

123

a
in either case, n == 123 and c == a.

The program in Listing 1 allows us to separate input items by commas instead. Non-white characters in a scan format, such as our comma here, must appear correspondingly in the input stream. If not, as is the case with the last example input line, scanf returns EOF as an error indicator (it normally returns the number of arguments successfully read). Note the whitespace around the comma in the format string. Without this, only the first input line would have succeeded.

The occurrence of %*c is an example of assignment suppression, meaning that the corresponding input is consumed but not stored. Its purpose here is to consume the newline character (it is assumed that the character a is followed immediately by a carriage return). Note that suppressed assignments do not contribute to the total argument count returned by scanf. When such programs are executed interactively, it is easy to get out-of-sync with the user. For example, if the user enters

123,abc
by mistake, the program will fail when it encounters the c (the b was consumed by the assignment suppression). With interactive programs, it is better to read an entire line, and then scan that line with sscanf. The program in Listing 2 ignores extraneous input, reports incorrect input, and then continues execution until the user signals an end-of-file.

It is also possible to control the size of the items read by adding a field width to the format descriptor (see Listing 3) .

Scansets

It is often convenient to control what type of characters are read into certain variables. This is done with a scanset, which is a format descriptor consisting of the set of acceptable characters enclosed in brackets. The example in Listing 4 uses this technique to read a string of binary digits.

Scansets differ from other format descriptors in that they do not skip initial whitespace. This technique can now be employed in a useful function (let's call it fgetb) that skips initial whitespace and returns a binary number from a given input stream (see Listing 5) . Using the same input as in the previous example, the output is

The number was 22
Scansets are also useful when input occurs in a fixed format, as in database applications, for example. The program in Listing 6 expects a line with four input items: a string, followed by two integers, followed by another string, all separated by commas.

When a circumflex occurs as the first element of a scanset, it reverses its meaning, i.e., the scan should collect characters that are not in the scanset. In plain English, the scan format in Listing 6 says: "Skip any initial whitespace, then build a string consisting of all characters up to the next comma, then ignore that comma, read two integers, ignoring the intervening comma, then skip any whitespace, and finally, collect all the remaining characters in the line as a string". The %i format descriptor reads integers according to their base prefixes (i.e., 0x for hexadecimal, 0 for octal, decimal otherwise).

A final example illustrates a use for the %n descriptor, which stores the total number of input characters consumed so far into a variable pointed to by the argument in the variable list. It extracts tokens (in this case, strings separated by whitespace) from standard input, but on a line-by-line basis (see Listing 7) .