Gregory Hildstrom Projects Publications Resume Links Contact About Google+ Facebook Youtube Donate




Regular Expression strstr() and strcasestr()

Introduction

I had some parser code that was fast and worked great. The down side was reliance on strstr() to find the beginning and end of sections that needed to be parsed, modified, and replaced. It worked fine on well-formed minimal input, but it was not tolerant of things like extra spaces or different case. I thought a more flexible replacement for strstr() would be a faster and simpler fix than totally rewriting the parser. A bunch of searching led me to a few differrent implementations of a strstr()-like function that used
POSIX regular expressions. One performed multiple regex searches and discarded all but the first result, which is inefficient. Another used strndup() internally, which didn't return a pointer to needle in haystack like strstr(), but a pointer to a copy of haystack starting at needle. The returned text is the same, but that makes it more difficult to calculate offsets and it wastes memory.

Both strstr() and strcasestr() return a pointer to needle in haystack or NULL. Memory is not allocated in the process and the returned pointer does not need to be freed.
char *strstr(const char *haystack, const char *needle);
char *strcasestr(const char *haystack, const char *needle);
My implementation tries to reproduce the behavior of these two functions as closely as possible using regular expressions.

regex_strstr.h

The interface is slightly different. With strstr() and a simple string needle, you can easily use strlen(needle), but that is not very useful if needle is a regular expression. These two functions return a pointer to the start of needle in haystack or NULL. They also pass by reference a pointer to the end of needle in haystack or NULL. The end pointer could be eliminated if it is not needed for your application and the interface must be identical to strstr() and strcasestr(). However, my application needs that additional information about the end and size of the matched text.
extern char *regex_strstr(const char *haystack, const char *needle, char **end);
extern char *regex_strcasestr(const char *haystack, const char *needle, char **end);

regex_strstr.c

#include <sys/types.h>
#include <regex.h>
#include <stdlib.h>
#include "regex_strstr.h"

static char *regex_search(const char *haystack, const char *needle,
char **end, int cflags) 
{
    char *start = NULL;
    *end = NULL;
    regex_t preg;
    regmatch_t pmatch;
    if (!haystack)
        return NULL;
    if (regcomp(&preg, needle, cflags)) {
        regfree(&preg);
        return NULL;
    }
    if (!regexec(&preg, haystack, 1, &pmatch, 0))
        if ((haystack + pmatch.rm_so) != NULL) {
            start = (char*)haystack + pmatch.rm_so;
            *end = (char*)haystack + pmatch.rm_eo;
        }
    regfree(&preg);
    return start;
}

char *regex_strstr(const char *haystack, const char *needle, char **end) 
{
    int cflags = REG_EXTENDED;
    return regex_search(haystack, needle, end, cflags);
}

char *regex_strcasestr(const char *haystack, const char *needle, char **end) 
{
    int cflags = REG_ICASE | REG_EXTENDED;
    return regex_search(haystack, needle, end, cflags);
}

Usage Example

Consider the following haystack:
hay
<needle>
hay
<    neEdle    >
hay
Now consider the following needle regular expression:
<[ \t]*needle[ \t]*>
The following code in
test.c will find both instances:
#include <stdio.h>
#include <stdlib.h>
#include "regex_strstr.h"

int main()
{   
    char *haystack = "hay\n<needle>\nhay\n<    neEdle    >\nhay";
    char *start = haystack;
    char *end = NULL;
    char *needle = "<[ \t]*needle[ \t]*>";
    while(start = regex_strcasestr(start, needle, &end)) {
        printf("needle %.*s with %d characters found at offset %d",
            end-start, start, end-start, start-haystack);
        printf(" with start[0]:%c and end[0]:%c\n",
            start[0], end[0]);
        start = end;
    }
    return 0;
}
The code will produce the following output:
needle <needle> with 8 characters found at offset 4 with start[0]:< and end[0]:\n

needle <    neEdle    > with 16 characters found at offset 17 with start[0]:< and end[0]:\n