How to Use Regex in C?

Learn via video course
FREE
View all courses
C++ Course: Learn the Essentials
C++ Course: Learn the Essentials
by Prateek Narang
1000
5
Start Learning
C++ Course: Learn the Essentials
C++ Course: Learn the Essentials
by Prateek Narang
1000
5
Start Learning
Topics Covered

What is a Regular Expression in C?

Regular Expressions or Regexes are used to represent a particular pattern of a string or some text, it consists of a string of characters, and are used to discover search patterns inside the target string. Each character in the regex or a regular expression is either a character having a literal meaning ie. it can be either a char from the set of a to z or 0 to 9 or some meta character that has its special meaning.

For example - A regular expression of the form b[a-z] can have the values ba, bb, bc, etc. Here b is the literal character, and the set [a-z] says any lowercase letter between a and z should be the next character both included, the square brackets [] here indicate that the value enclosed inside them is a variable value and whose value is based on the character set written inside it, meaning it can be any possible char value from the given character set.

Almost every programming language supports regex, in fact, if we normally talk about most the text processing applications like lexers, advanced text editors, and some rich markdown editors they as well use regex as a use case of string matching.

Regex is extensively used in string matching or similar workflow applications or for validating inputs. Validating inputs means, suppose we are writing a program in which we only want to accept integer values from the user, then for that we can put a condition inside the scanf() function in the following way scanf("%[0-9]", &data); and because of this restriction the program will only accept the values from the set of 0-9 and will discard every other invalid input, sounds simple? even if this is seeming to be a lot at the start, no worries we will be writing very simple and to-the-point programs for every possible case mentioned in this article, so let's move forward and try to understand, what are patterns in the POSIX library.

Patterns in POSIX Library

Regex is not the property of any particular language, rather it is a term in which a sequence of characters is used to match a pattern in any given target string. Talking about POSIX is a widely known library in the C language and most of its classes are present inside the regex.h header file and are primarily used for the implementation of regular expressions.

Let's have a look at the below table, here we are having various POSIX classes and with respect to them, their character equivalent representations and their follow-up descriptions are given explaining what each class will return as a match.

POSIX ClassEquivalent toMatches
[:upper:][A-Z]Uppercase letter
[:lower:][a-z]Lowercase letters
[:digiit:][0-9]Digits
[:alpha:][A-Za-z]Uppercase and lowercase letter
[:alnum:][A-Za-z0-9]Uppercase letters, lowercase letters and numerical digits
[:blank:][ \t]Space and Tab character only
[:xdigit:][0-9A-Fa-f]Hexadecimal digits
[:word:][A-Za-z0-9_]Word characters
[:ascii:][\x00-\x7F]All ASCII characters
[:cntrl:][\x00-\x1F\x7F]Control characters (defined below)
[:punct:][-!"#$%&'()*+,./:;<=>?@[]^_{\}~]All punctuation characters except letters and digits
[:space:][ \t\n\r\f\v]All whitespace (blank) characters like space, tab, new line, etc.
[:graph:][^ [:cntrl:]]Graphic characters ie. all the characters that have graphical or printable representation
[:print:][[:graph:]]Graphic characters and space

Description of some POSIX classes -

  • [:cntrl:] - It looks for the match of control characters in the given target string. Control characters mean characters that do not represent a written symbol and are often used to perform actions rather than existing as a printable character on the screen, some easily understood examples include "Backspace", "Delete", "Escape" etc.
  • [:print:] - It matches all the printable characters in the given target string like alphabets, numbers, etc. along with that it also matches the whitespaces.

Syntax of Regex

The general syntax used for creating and compiling regular expressions in C is,

It takes three compulsory arguments

  • regex - pointer to the memory region where the expression is matched and stored, meaning after the successful execution of the regcomp() function, the function will compile the regular expression contained in the expression argument and will store that to the memory location pointed by the pointer regex.
  • expression - pointer to a pattern of type string
  • flag - specifies the type of compilation, we generally pass this value as 0.

Return Value

The function returns the value 0 if the compilation of the regex is completed successfully and will return Error_code if the compilation was not successful.

Example

Output

Explanation

  • Starting from the main() function, firstly we are declaring a variable of type regex_t to create the regex, then we are declaring an int variable as well to store the outcome of the regex creation process, that whether it was successful or not.
  • After that we are calling the regcomp() function with the expression as [:word:], upon execution, the given expression will be compiled and will get stored at the memory location pointed by the variable rx.
  • Then after we are checking for the return value of the function regcomp(), if the value comes to be 0, then the compilation process of the given expression was successful, if so we are printing the success message, else printing the error message.

Note - The library regex.h comes preinstalled as a part of the libc development package in Linux and macOS, however, in the case of windows, Microsoft does not provide regex routines by default, the possible solutions are,

  • Manual download of the regex library in the system to work with C language, or
  • Move to C++, and use the c++ regular expression routines as they are part of the C++ Standard from C++11 onwards.

Matching a Pattern with C Regex

The function regexec() inside the regex.h library can be used to compare a given string to any given pattern. For that, firstly the function regcomp() is used to compile a given regular expression into a form that is suitable for the function regexec(), in order to perform the pattern comparisons.

The syntax of the regexec() function is as follows

The function accepts 5 compulsory arguments,

  • regex - a pattern that has been precompiled by the regcomp() function to process the comparisons, as stated above.
  • expression - a string variable containing the target string in which we have to look for the pattern.
  • nmatch - nmatch provides the value of the number of substrings in the given string pattern that the function regexec() should try to match with the pre-compiled expression stored at the memory location, regex, we generally pass this value as 0.
  • pmatch - it contains information about the location of the matches, meaning here we pass an array or a pointer variable, to store the location information of the matched substrings, the array we pass must have at least nmatch element space inside it or only the first nmatch-1 elements will be stored in it, but if the value of the nmatch argument is 0, which is generally the case, then the regexec() function ignores the pmatch argument, and this is the reason, along with the nmatch as 0, we pass pmatch too with the value as NULL.
  • flag - describes the customizable behavior of the regexec() function, it is the bitwise or of 0 or one of the following flags (REG_EXTENDED, REG_ICASE, REG_NOSUB etc.), we generally pass this value as 0

It will return 0 if the match is found, else will return the error value REG_NOMATCH.

Let's see an example to understand the actual use case of the above function,

Output

Explanation

  • Starting from the main() function, we are firstly declaring a variable of type regex_t to create the regex, then we are declaring three int values to store the outcome of the string matching procedure.
  • Then we are creating the regex using the regcomp() function, here the string "Hello World" is our expression string, it will be compiled and then will be stored at the memory location pointed by the rx variable.
  • And after that we are executing the regexec() function to match the target string that we are passing as the second argument with the pre-compiled regular expression stored at the memory location pointed by rx, and are storing the returned value of the operation into d1 variable, we are doing the same process 2 more times and storing the result in d2 and d3 variables respectively.
  • After that we are finally calling the displayPattern() function to put the message on the screen if the pattern was found or not for each executed operation.

Use of C Regex Expressions

C Regex with scanf()

We can also use the regex to modify the user input we want to take in our program, meaning sometimes if we want to accept only the char inputs or if we want to accept only the numerical inputs or something else too, then this can be made possible using the c regex, what we can do is, we can tweak the /scanf() function and can put that specific condition inside it regarding the inputs which we want the function to accept.

For example - Here in the below code, we want to restrict the program to only accept uppercase and lowercase alphabetical inputs.

Let's see the program to understand it better

Input

Output

Input

Output

Explanation

  • Here in this example, we want to restrict our program to only accept the uppercase and lowercase alphabetical values using the scanf() function combined with regular expressions, starting from the main() function, we are firstly declaring a char array of size 20 to store the user input
  • After that we are asking the user to enter their name, now while receiving the input, we are strictly putting a check on what value to be accepted, here in this case the program is only going to accept alphabetical values, and after receiving the input we are finally printing it.
  • In the sample inputs, we can see that upon entering the input as "HelloWorld24", the scanf() function only recorded the alphabetical parts of the value.

FAQs

Q. The regex.h header file is not found in my system, what should I do?

A. As mentioned above, regex.h comes preinstalled in Linux and Mac systems as a part of the libc development package, but in the case of windows, it does not comes pre-installed, so either the users have to manually install it, or they can switch to the c++ language as regex is a part of official c++ standard.

You may want to read Regex in Java, Regex in Python.

Conclusion

  • Regular Expressions or Regex consists of a string of characters and are used to match patterns in any given target string.
  • Regex is not the property of any particular language, rather it is a term in which a sequence of characters is used to match some string pattern.
  • POSIX is a widely known library in C language used for the implementation of regular expressions.
  • Syntax for regex creation and compilation: regcomp(&regex, expression, flag); It takes three arguments, the first being the pointer to the memory location where the expression is matched and stored, the second being the pointer to a pattern of string type, third being the flag which specifies the type of compilation to process.
  • It returns 0, if the compilation passes, else will return Error_code.
  • The function regexec() is used to compare a target string to the given pattern, it returns 0 if the match is found, else will return REG_NOMATCH.
  • We can also use regex with the scanf() function, to restrict what type of input we want from the user.