Code Capsules

Text Processing III: Substrings

Chuck Allison


Much of text processing concerns itself with substrings, that is, finding or extracting strings embedded within lines of text. The program find1.c (Listing 1) prints all lines from a text file that contain a given string.

Applying find1 to its own source file with "search" as the search string (argv[1]) gives the output

char *search_str;
search_str= argv[l];
if (strstr(line,search_str))
find1 calls the function strstr to determine if one string is a substring of another. strstr(s1,s2) returns a pointer to the first occurrence of s2 in s1, if it exists, or NULL if it doesn't. Only exact matches succeed (so the line with the comment containing Search didn't print).

To ignore case in the search, convert copies of the strings to the same case, as the program find2. c (Listing 2) illustrates. Processing find1. c with find2 now gives all occurrences of the search string, regardless of case:

char *search_str;
return 1;  /* Search string required */
search_str= argv[1];
if (strstr(1ine,search_str))
Some compilers provide strlwr, a function not in the standard library. (It originated with UNIX). However, you can easily write it yourself with tolower as demonstrated in Listing 3.

Tokens

Many programs are command driven. That is, they sequentially process lines of text representing user instructions. (This is how command interpreters like the MS-DOS and UNIX shells work, of course, and how the line-oriented text editors of yore worked, remember?) A program will parse each line into its components, usually called tokens. The library function strtok recognizes tokens as substrings scattered among separators (sometimes called break characters). It skips any leading separators, and then collects characters as a substring until another separator is encountered. The program in Listing 4 extracts tokens by ignoring space and punctuation characters. Figure 1 contains sample input and Figure 2 contains output from token1. c.

The program token1 first calls strtok with a pointer to the beginning of the line to be parsed. strtok inserts a null character directly into the string to delimit the first token (overwriting the space after the s in This). Then it sets its internal pointer to the character after that null character (the i in is), and returns a pointer to the beginning of the first token (the T in This), as illustrated in Figure 3.

When we call strtok with a NULL first argument, it picks up where it left off (the i in is). When it can no longer find any tokens, strtok itself returns NULL. Note that you can change the break set with each call to strtok. That makes this parsing scheme somewhat more flexible than using sscanf, although inserting null bytes into the string can be awkward in some instances. (See the Code Capsule "Text Processing I" in CUJ October 1992, for sscanf examples).

To ignore digits as well as space and punctuation, you merely add them to the break-set string. It doesn't take long, however, to realize that break sets can become quite unwieldy. It is often easier to specify the characters that comprise tokens rather than those that separate them. Listing 5 introduces such a function, strtokf, similar to strtok, except that it recognizes tokens via a user-supplied function that identifies acceptable characters. The program in Listing 6 uses strtokf to extract alphabetic tokens from the same input as in the previous example. This time the output is:

This
is
just
a
test
Good
bye

Parsing Delimited Input

Another common parsing practice is to locate specific delimiting characters in a string. (This is especially useful for parsing filenames.) The standard library provides this capability via the two functions

char *strchr(char *s, char c);
char *strrchr(char *s, char c);
strchr returns a pointer to the first occurrence of c in s and strrchr returns a pointer to the last occurrence (the extra r in its name signals that it searches from the rightmost or rearmost). Both functions return NULL if the character is not found. The program in
Listing 7 uses strchr to extract fields separated by commas (see Figure 4 and Figure 5 for the input and output, respectively).

strchr is particularly useful in cases like this where the delimiters can occur adjacent to one another. For example, the input line

,,,
would be passed over by strtok as a stream of separators.