June 1994/Code Capsules

Code Capsules

Control Structures

Chuck Allison

It has been 26 years since Edsger Dijkstra wrote his famous letter entitled "GOTO Statement Considered Harmful," which, as the title suggests, made a strong case for never using a branching construct in a program [1]. In another letter he wrote, "... for some time I knew that, as a programmer, I could live quite happily without any form of the goto statement, but ... in the meantime my considered opinion is that I cannot live happily with the goto statement." [2] There followed a notorious movement to eliminate the goto construct altogether. Professors refused to accept students' programming assignments if they contained a goto. Languages were designed without a goto construct. The upside of all this upheaval was that as an industry we learned to think "structured," that is, to organize our software in a more logical and maintainable fashion. The downside is that we abandoned a potentially useful construct — sometimes a goto is just what the doctor ordered. (By the way, almost all current programming languages in widespread use support some form of goto).
In this article I will discuss the mechanisms available to you in C to structure the logic of your programs. These mechanisms include not only the goto-less constructs of structured programming, but also branching techniques that range from the simplistic to the sophisticated: break, goto, non-local jumps (with setjmp/longjmp), and signals.

Structured Programming
In 1966 Bohm and Jacopini proved mathematically that it is possible to express any algorithm in terms of only three constructs, along with an arbitrary number of boolean flags [3]. The three constructs are:
1. sequences of statements
2. alternation (e.g., if-else, switch)
3. repetition (e.g., while, for, do)
We usually call programs that use only these mechanisms structured programs. Unfortunately, when structured-design gurus began their hard-sell to the industry, most programming professionals were making a living with COBOL, FORTRAN IV, and assembly language, while BASIC was gaining popularity in educational settings. Of these languages, only COBOL had the else construct. Telling a FORTRAN programmer (such as me) that he shouldn't use a goto usually resulted in him telling you where to go. Besides, we were proud of our expert ability to cram complex logic into as few lines as possible — who cared if the lame-brains couldn't decipher code of such "elegance?" We couldn't have done this Great Work without the goto P.J. Plauger has said, speaking of the structured programming revolution, "Us converts waved this interesting bit of news under the noses of the unreconstructed assembly-language programmers who kept trotting forth twisty bits of logic and saying, 'I betcha can't structure this.' Neither the proof by Bohm and Jacopini nor our repeated successes at writing structured code brought them around one day sooner than they were ready to convince themselves." [4]
Listing 1 shows a vintage BASIC program that plays the game of "Hi-lo." You think of a number between 1 and 100, and just tell the computer whether its guesses are high, low, or right on. As long you give honest responses, the binary search algorithm will find your number in seven guesses or less (because ceil(log2(100)) == 7). Compare this program to the C version in Listing 2. The BASIC version is one-third the size of the C version, but is it more readable? Is it as maintainable? Is the structure of the logic evident from the shape of the program? No, no, no!
As a consequence of Bohm and Jacopini's work, it became natural to use alternation and repetition as the basic tools of thought when designing a logical process. It soon was quite popular to describe procedures in structured prose, or pseudocode, a terse, verbal description governed by loops and conditions. With pseudocode, you describe conditionals much the same as in any of the languages that support the if-else construct:
if <condition>
   do this...
else
   do that
To choose from a group of related values, you can use the case statement:
case x of
   1: do this...
   2: do that...
   3: do the other thing...
Structured programming allows separate controls for the various flavors of loops. Most of the time you need a control that tests its condition before entering the loop body:
while <condition>
   do this...
A for loop is good for processes that must execute a specific number of times, as in
for i = 1 to n
   do something dependent on i
or for processing a sequence of items:
for each student on the roster
   calculate something...
On rare occasion you might need a loop that tests its controlling condition after executing the loop body:
repeat
   get user input
until input is in the range [1..5]
Listing 3 shows a pseudocode description of a process to merge two sorted text files. This code reads a line from each file and prints to standard output the line that lexicographically precedes the other. The process repeats until it empties one of the files, at which point it just outputs the rest of the other file.

Careful Conditioning
It is important that you relate a loop's condition clause to your code's design in a direct and understandable way. To convince yourself that you have correctly coded the condition clause, insert a comment at the beginning of the loop body that states what the condition of the loop means in terms of the task to be completed or problem to be solved. If the comment clearly expresses the solution you had in mind, you've crafted an effective loop. Do the same thing at the end of the loop.
The implementation of the merge procedure in Listing 5 shows an example of this practice. (The input data is in Listing 4. ) You can easily extend the merge algorithm to process a larger number of files. At first glance you might think that the loop control statement
while (neither file has been exhausted)
should become
while (no file has been exhausted)
But if you have n files, when that loop finishes you will still have n-1 active files to continue merging. You need to stay inside the merge loop until only one file is still active:
while (number of active files > 1)
   etc.
This condition clause requires that you keep a list of active files, from which you will delete an entry when its file runs out of input lines. The design now looks like this:
while (number of active files > 1)
{
   determine which line comes next
   print that line
   if (the corresponding file
      is now empty)
      delete it from the list
      of active files
   else
      get a fresh line from the file
}
empty the remaining file
Since you have introduced a mechanism to track the active files, you might as well stay in the loop until the last file is empty, and there is nothing else to do. The pseudocode in Listing 6 reflects this change, as do the invariants in the implementation in Listing 7.

Branching
Even if it is possible to program without gotos, it is often convenient to branch out of the middle of a loop (a special case of a goto). Using C's break statement can help you avoid littering your code with extraneous loop control variables. For example, the variables done and found in Listing 2 exist only to terminate their loops. Most C programmers do without them (see Listing 8, where I use a break instead).
Suppose you decide to rewrite the if-else statements under the comment /* Narrow the search range */ as a switch statement:
/* Narrow the search range */
switch(response)
{
   case 'L':
      lo = guess + 1;
      break;
   case 'H':
      hi = guess - 1;
      break;
   case 'Y':
      puts("What? Try again...");
      break;
   default:
      break;
     /* This doesn't work! */
}
Unfortunately the break under the default case exits the switch, not the loop. In this case you can use a goto or use the if-else construct as before.
Exiting deeply nested loops with a goto can be much clearer than the boolean flag approach the purists advocate. Consider this excerpt that searches a 3-dimensional array for a certain value:
for (i = 0; i < n; ++i)
   for (j = 0; j < m; ++j)
      for (k = 0; k < s; ++k)
         if (a[i][j][k] == x)
            goto found;
found:
/* use i, j, and k here */
A naive attempt at a non-branching version gives:
for (done = 0, i = 0; !done && i < n; ++i)
   for (j = 0; !done && j < m; ++j)
      for (k = 0; !done && k < s; ++k)
         if (a[i][j][k] == x)
            done = 1;

/* Fix indexes (yuck!) */
--i, --j, --k;
Even if you don't mind the way this code looks, it's poorly designed because it increments the loop variables beyond the correct values, since it passes control to the enclosing loops all the way to the top level, instead of exiting immediately.
Using the goto as a multi-level loop exit can help restore things to a stable state after a serious error, as in:
main()
{
/* Some one-time initialization here */

recover:

   for (;;)
   {
/* Some more initialization here  */

/* Things get deeply nested...  */

        if (<things go crazy>)
           goto recover;
   }
}
This is a common technique in hardware control programs.

Non-Local Branching
What if you encounter an exceptional condition deep within a set of nested function calls and your recovery point is back in a function higher up in the call chain (like main)? What you need is a "super goto" that branches across functions. That's what the setjmp/longjmp mechanism is for. You record a point to return to with setjmp, and branch to it with longjmp. Here's the syntax:

#include <setjmp.h> jmp_buf recover; main() { volatile int i = 0; for(;;) { if (setjmp(recover) != 0) { /* Recover from error in f() */ } /* Get deeply nested... */ } return 0; } ... void f() { /* Do some risky stuff */ if (<things go crazy>) longjmp(recover,1); /* else carry on */ }
A jmp_buf is an array that holds the system information necessary to restore execution at the setjmp point. (Obviously, a jump buffer must be global or passed as a pointer argument). When you call setjmp, the system stores the calling environment (such as the contents of the stack and instruction pointer registers, etc.) in recover. setjmp always returns zero when called directly. A call to longjmp restores the calling environment, so execution continues back at the point of the setjmp call. At this point, the program appears to return yet again from setjmp, with one difference from before: it now appears as if setjmp has returned the second argument from the longjmp call (in this case a 1). Note however that if you give longjmp a second argument of 0, setjmp still returns a 1 because C forbids setjmp to return a 0 upon return from a longjmp. In this one case the compiler drops in a "hard-wired" return value. Since a longjmp performs an alternate return from a function, it interrupts the normal flow of a program, so you should use it only to handle unusual conditions.
You need to take a few precautions when using setjmp/longjmp. For one thing, the function containing the setjmp target must still be active (i.e., must not yet have returned to its caller), otherwise the calling environment will no longer be valid. And of course, it is a bad idea to have more than one setjmp target with the same jmp_buf variable (now that would be confusing!). Since calling environments typically involve registers, and since a compiler is free to store automatic variables in registers, you have no idea if an automatic object will have the correct value when you return from a longjmp. You should declare automatics in any function containing a setjmp with the volatile qualifier, which guarantees that the inner workings of longjmp will leave them undisturbed.
There is another consequence of not being able to trust automatic storage during a longjmp: the setjmp call itself must not appear in an expression that may generate a temporary object, such as an intermediate computation. For this reason a call to setjmp may only appear as the sole controlling expression in an if or switch statement, or as an operand in an equality expression. You cannot reliably assign the result of setjmp to another variable. The program in Listing 9, which deletes subdirectory trees, calls setjmp from within a switch statement. (If you're interested in the details of this algorithm, see "Code Capsules: File Processing, Part 2," CUJ, June 1993.).

Signals
A signal occurs when an unusual event interrupts the normal execution of a program, such as a divide-by-zero error or when the user presses the attention key (e.g., Control-C or DEL). The header <signal.h> defines the following "standard" signals:

SIGABRT abnormal termination (raised by abort()) SIGFPE computational exception (e.g., overflow) SIGILL invalid function image (e.g., illegal instruction) SIGINT interactive attention (e.g., Control-C) SIGSEGV attempt to access protected memory SIGTERM termination request
These signals originated on the PDP=11 architecture under UNIX, and may not all apply to your environment. An implementation may also define other signals, or may ignore signals altogether, so signal handling is by nature non-portable.
Signals come in two flavors: synchronous and asynchronous. A synchronous signal is one that your program raises, such as by dividing by zero, overflowing a floating-point operation, or issuing a call to the abort function. You can raise a signal explicitly with a call to raise, for example:

raise(SIGABRT);
An asynchronous signal occurs due to forces external to your program, such as a user pressing the attention key.
The default response to most signals is usually to abort the program, but you can either arrange for signals to be ignored or provide your own custom signal handlers. The following statement from Listing 10
signal(SIGINT,SIG_IGN);
tells the environment to ignore any keyboard interrupt requests (Control-C on my machine). The keyboard input process still echoes the ^C token on the display, but it does not pass control to the default signal-handling logic that terminates the program. The statement
signal (SIGINT, SIG_DFL)
restores the default behavior, so you can halt the program from the keyboard.
The program in Listing 11 invokes the library function signal to install a function (abort_handler) to respond to SIGABRT. A signal handler must take an integer argument and return void. The first call to abort transfers control to abort_handler, passes SIGABRT as the integer argument (which I don't happen to use in this case), and prints the message
Abort signal intercepted
When execution enters a signal handler, the environment considers the associated signal "handled" before the very first statement executes. This breaks any subsequent association between your handler and the signal, and restores the default handling. This means that if your handler returns, it will not get control back the next time that signal is raised. That is why when control resumes where it was interrupted in main, and it calls abort the second time, the program actually terminates. In practice an abort handler should not return but should terminate the program with a call to exit instead (in this example you certainly shouldn't call abort!). Likewise, if you return from a SIGFPE handler the results are undefined.
If you want to handle a signal every time it occurs, your handler should reinstall itself every time it executes. As the SIGINT handler (ctrlc) in Listing 12 illustrates, you should do this at the beginning of an asynchronous signal handler. If you put it off until later in the function, the same signal can occur before you've reinstalled it, causing you to lose the signal to the system default handler. In fact, there is very little an asynchronous signal can do safely. The signal you're processing could have occurred during the execution of a library function. Since the Standard C library is not reentrant, it would be a bad idea to call that library function from your handler. The rule of thumb is: Don't call library functions from within a function that handles an asynchronous signal. Nor should you call any function that might raise another signal. As the ctrlc function shows, there are only two safe things you can do before returning from an asynchronous handler:
1. Call signal to reinstall the handler, and
2. Assign a value to a static object of type volatile sig_atomic_t.
sig_atomic_t is an integral type that is guaranteed not to be interrupted by a signal while it is being accessed. You can use such a value as a flag or counter when normal execution resumes. Although redundant, it is a good practice to place an explicit return statement at the end of a handler that returns, to distinguish it from one that halts the progam.
In spite of what the preceding paragraph says about using library functions, it is nevertheless routine practice in many environments to longjmp out of an asynchronous signal handler. It works fine under DOS and most of the time under UNIX. As a common example, Listing 13 shows the outline of a command-line interpreter, a program that reads a sequence of text commands and performs the indicated functions (similar to sh in UNIX or DOS's COMMAND.COM). If you press the attention key from deep within the logic of an internal command, you want to return to the command loop, not terminate the program. longjmp allows you to do this.

Exit Handlers
This is as good a time as any to talk about exit handlers. With the atexit function you can register up to 32 functions which will be called when a program exits normally (i.e., not via abort). The functions execute in the reverse order that you registered them. The program in Listing 15 (pseudocode shown in Listing 14) is an external sort, a program that sorts files larger than can fit in memory. It does this by breaking the large file into smaller subfiles that do fit in memory. After sorting all subfiles the program merges them to standard output using the algorithm from Listing 6. If the program halts prematurely due to some external error, it may fail to delete some of the temporary subfiles from memory. To clean up the mess, the exit handler, remove_temps, removes any left over subfiles from memory (atexit appears near the beginning of sort in the listing). Note the goto in the loop that processes the command-line options. This program allows you to mix options in a single argument as in:
vsort -ir file (same as vsort -i -r file)
but expects any numeric characters to appear last in a token, as in:
vsort -n8 file
The program ignores any options that follow a numeric string; for example, it ignores v in:
vsort -n8v file
The goto serves as a multi-level break and continue at the same time, since it exits the loop that traverses a single argument to cycle on the loop that visits each argument. This program also illustrates the following concepts from previous Code Capsule articles:
November 1992:
Variable-length formatting
In-core formatting
April 1993:
Sorting with qsort
May 1993:
Command-line arguments
Temporary file names
I/O redirection
Deleting files
August 1993:
Pointer arithmetic

Summary
The C language supports structured programming as follows:

Alternation: if-else, switch, conditional (ternary) expression

Repetition: while, for, do
Clarity and economy sometimes require you to use the branching constructs:

Branching: break, continue, return, goto
And once in a great while, you need the non-traditional control mechanisms that the C library provides:

Non-local branching: setjmp, longjmp

Signals: signal, raise
The further you progress down the list, the more careful you have to be. Next month I'll explore C++'s contribution to flow control: exceptions.

References
1. E. Dijkstra, "goto Statement Considered Harmful," Communications of the ACM 11:3, p. 147, March 1968.
2. E. Dijkstra, "Stepwise Program Construction," printed in Selected Writings on Computing: A Personal Perspective, Springer-Verlag, 1982, p. 2.
3. C. Bohm & G. Jacopini, "Flow Diagrams, Turing Machines, and Languages with Only Two Formation Rules," Communications of the ACM 9:5, p. 266, May 1966.
4. P. J. Plauger, Programming on Purpose, Essays on Software Design, Prentice-Hall, 1993, p. 25.