Code Capsules

Sorting with qsort

Chuck Allison


Chuck Allison is a software architect for the Family History Department of the Church of Jesus Christ of Latter Day Saints Church Headquarters in Salt Lake City. He has a B.S. and M.S. in mathematics, has been programming since 1975, and has been teaching and developing in C since 1984. His current interest is object-oriented technology and education. He is a member of X3J16, the ANSI C++ Standards Committee. Chuck can be reached on the Internet at allison@decus.org, or at (801)240-4510.

If you've been programming for very long you've probably noticed that many of your programs spend a noticeable amount of time ordering data elements. In his classic work "Sorting and Searching" (The Art of Computer Programming, Vol. III, Addison-Wesley, 1973), Don Knuth reports that the average program spends over 25% of its time sorting. Sorting is useful for grouping related items, eliminating duplicates, comparing or merging related sets, and helping searches. You'll certainly want to use an efficient algorithm for such a frequent activity.

This article describes the use of qsort, the sort function supplied with the standard C library. In most compilers, qsort is an implementation of the Quicksoft algorithm, which is, on average, the most efficient of all general-purpose sorts. (However, an implementation is not required to use the Quicksort algorithm — it just needs to sort arrays according to the specifications illustrated below) .qsort is general purpose because you can use it to sort an array of elements of any type, with simple or compound sort keys. It only requires that any two elements can be compared and that the array is in memory .qsort modifies the array by swapping elements so that they come out in ascending order.

The program in Listing 1 shows how to use qsort to sort an array of integers. qsort (declared in <stdlib.h>) takes four arguments: the array to sort, the number of elements in the array, the size of each element in bytes, and a user-supplied compare function. A compare function, f, must have the prototype int f(const void *p1, const void *p2). (Using void *allows pointers to any type). qsort repeatedly calls the compare function, passing it pointers to two array elements. The compare function must behave like strcmp, i.e., it must return a negative number if the value that p1 represents precedes that of p2, 0 if the values are equal, and a positive number otherwise. In this example icomp does this by returning the difference of the two integers being compared. Note that the incoming pointers must be cast to the appropriate type (int * in this case) in order to reference the values to compare.

Listing 2 shows how to sort an array of strings (i.e., an array of pointers to char) in descending order. You simply supply qsort with a compare function that reverses the order of the sort by negating its return value. Since some_strings[] is an array of char *, qsort passes pointers to char * (i.e., char **) to the compare function scomp. scomp dereferences once and passes the pointers to char to strcmp. Don't overlook the fact that it is the pointers that qsort reorders, and not the strings themselves.

The program in Listing 3 sorts an array of structures in two different ways: first by name (last, then first), and then by age. Most of the time it is faster to swap pointers than structures. (Structures tend to be bigger than pointers.) When this is the case, it will be more efficient to pass qsort an array of pointers to structures instead of an array of actual structures. This of course requires another level of indirection in the compare functions (see Listing 4) .

How Generic Sorting Works

qsort is able to sort any array because you supply it with all it needs to know to do the job: where the elements are, how many, how big each one is, and how to compare them. To illustrate the technique, I'll take the simplest of comparison-sort algorithms, selection sort, and make it generic. Selection sort makes n--1 passes on an array of n elements. In the first pass, it compares the first element to the rest and swaps it with any that precede it in value (so after the first pass, the smallest element is now the first). It repeats the process with the second element, and so on until the array is in ascending order (see Listing 5) .

To make this process generic, I'll parameterize the four items mentioned above, just like qsort (I'll call it ssort — see Listing 6) . The statements

char *p1 = (char *) array + i*size;
char *p2 = (char *) array + j*size;
determine pointers to the ith and jth array elements. These addresses are then passed to the user-supplied compare function. If the elements are out of order, ssort puts them in order by swapping their associated blocks of storage. Listing 7 shows that usage of ssort is identical to qsort.

Indexes

Sometimes you want to access data in a certain order without actually disturbing its contents. This is especially important if you need multiple orderings repeatedly in the same program — you don't want to have to sort more than once for each ordering. This is the motivation for keeping indexes. If you don't mind having the data array global, indexes are easy with qsort. Listing 8 shows how to construct an index for an array of integers. idx is the index array (sometimes called a permutation vector) for some_ints. After initializing it to the "identity" permutation (i.e., idx[i] = i), call qsort to sort idx, not some_ints. You use idx as an indirection device to access some_ints in ascending order, although some_ints itself hasn't changed. Note that some_ints must be global because the compare function needs access to it.

Indexes have many uses. Listing 9 shows how to extract only unique elements from an array non-destructively. This program can easily be modified to print a listing of the unique tokens in a text file. The program in Listing 10 populates an array with the alphanumeric tokens read from standard input. It then prints only unique tokens in ascending order, while the original array still contains all the tokens in their original order (see Figure 1) . This is quite a feat for such a short program. Although this lazy technique is not worthy of a serious parser (since it picks up tokens that start with a numeral, or that appear in strings or comments) you may find a use for it. I also cheated a little by using ranges in the scansets. The format descriptor %*[^A-Za-z0-9] instructs scanf to ignore any contiguous sequence of non-alphanumeric characters, while %[A-Za-z0-9] matches any such sequence (for an explanation of scansets, see "Code Capsules: Text Processing I", CUJ, October 1992). But scanset ranges are not guaranteed to be portable in ANSI C, although many popular compilers support them. If yours does not, replace A-Za-z0-9 with all 62 alphanumeric characters spelled out.

There is of course a lot more to sorting. I have only considered internal sorting (i.e., when you are sorting an array confined to memory). Some other sorting techniques perform more efficiently than qsort by using the characteristics of the data to optimize the process (e.g., radix sort, which only works on integral types). Like Quicksort, qsort is not a stable sort. This means that elements with equal keys may lose their relative ordering after sorting (e.g., if you sort the data in Listing 3 by last name only, Edsel Ford may precede or follow Henry Ford — their relative ordering is indeterminate). If you are interested in external sorting or want to explore the innards of many different sorting algorithms, see Knuth's book cited in the beginning of this article. (Next month's column will also discuss external sorting.) In this column I have tried to illustrate how much mileage you can squeeze out of a single, well-designed sort function.