Regular Expression Characters and Practices

This section provides an overview of regular expression characters and recommend practices.

Match Characters

Notation	Characters Matched	Example
\d	Any digit from 0 to 9	`\d\d\d` matches `101` but not `10a`
\D	Any character that is not a numeric digit (0 to 9)	`\D\D\D` matches `abc` but not `101`
\w	Any word character, for example, a-z, A-z, 0-9, and the underscore character _ (also matches Unicode-based word characters from non-latin alphabets and scripts)	`\w\w\w` matches `abc` but not `&@#`
\W	Any non-word character	`\W\W\W` matches `$#!` but not `abc`
\s	Matches any whitespace character	`\s\s\s` matches (three spaces) but not `abc`
\S	Matches any non-whitespace character	`\S\S\S` matches `a1_` but not (three spaces)
.	Matches any character	. matches any character except line breaks
[ ]	Any character between the square bracket	`[abc]` matches `a` or `b` or `c` but not other characters
[^ ]	Matches any character except the characters appearing after the ^ and before the ]	`[^abc]` matched `def` but not `abc`

Repetition Characters

Notation	Characters Matched	Example
{n}	Matches n of the previous item	`\w{4}` matches `AAAA` but not `A`
{n, }	Matches n or more of the previous item	`\w{4, }` matches `AAAAA` but not `A`
{n,m}	Matches at least n and at most m of the previous item. If n is 0 that makes the character optional ({,9})	`A{2,3}` matches `AA` and `AAA` but not `A` or `AAAA`
?	Matches the previous item 0 or 1 times	`A?` matches `A` or nothing but not `AA`
+	Matches the previous item 1 or more times	`A+` matches `A`, `AA`, `AAA` but not nothing
*	Matches the previous item 0 or more times	`A*` matches nothing, A or any number of `A` characters

Position Characters

Notation	Description
^	The following pattern must be at the start of the string, or for a multi-line string, at the beginning of a line. For multi-line text (string containing a carriage return), the multi-line flag option needs to be set.
$	The preceding pattern must be at the end of the string, or for a multi-line string, at the end of a line.
\A	The preceding pattern must be at the start of the string; the multi-line flag is ignored.
\Z	The preceding pattern must be at the end of the string; the multi-line pattern is ignored.
\b	Matches a word boundary, essentially the point between a word character (a-z, A-Z, 0-9, _) and a non-word character (the start of a word).
\B	Matches a position that is not a word boundary (not the start of a word)

Grouping

Notation	Characters Matched	Example
()?	Matches the pattern inside the brackets 0 or 1 times	`(Error)?` matches Error or nothing
()+	Matches the pattern inside the brackets 1 or more times	`(\w+\s)+` matches `AA AA`
()*	Matches the pattern inside the brackets 0 or more times	`(\w+\s)*` matches nothing or `AA AA`

The Non-Greedy Qualifier (?)

The non-greedy qualifier is a question mark (?) following a repetition character (*+?). The non-greedy qualifier is used to tell the regex engine that it should stop matching the current match as soon as the next match criterion is met. The non-greedy qualifier is used in combination with a repetition qualifier in order to create a non-greedy match. The non-greedy qualifier improves performance when you want to match any text value up to a specific text value where the specific text value can be uniquely specified within the regex.

For example, suppose your regex needs to match the following log:

02/28/2007 16:55:22 MsgID=1590 : Failed authentication for user john.doe user account locked out

If you use the following regex, incorrect values will be parsed for the login field due to the fact that user occurs twice in the log message. Using this regex will cause “account” to be parsed into the login field.

MsgID=1590.*user (?<login>\w+\.?\w*)

This is because “.*” will match everything to the end of the log message. When the regex engine reaches the end of the log message it will begin looking backwards in the log message for the next match. As soon as it finds the last occurrence of “user” it will match for that portion of the log message. Since the specified regex for “login” will match account, it will use that match and continue.

To make the regex take the first occurrence of the next match you use the non-greedy qualifier. The following regex will parse the correct value into the login field because it will stop the previous match (.*) as soon as “user” is encountered.

MsgID=1590.*?user (?<login>\w+\.?\w*)

Reserved Characters

The regex engine used by LogRhythm has 12 reserved characters that have special meaning. If any of these characters need to be used as a literal character they will need to be escaped using the backslash (\) character, otherwise known as the escape character. The reserved characters are:

The opening square bracket [
The opening round bracket (
The closing round bracket )
The backslash \
The caret ^
The dollar sign $
The period .
The vertical bar or pipe symbol |
The question mark ?
The asterisk or Kleene star *
The plus sign +
The opening curly bracket {
The closing curly bracket }

The following regex, which is meant to match any IPv4 address (a.b.c.d), is a simple example of how to escape reserved characters:

\d+\.\d+\.\d+\.\d+

As you can see each of the periods of the IP address are escaped meaning the regex engine will look for the actual period (.) character in the string instead of looking for any character. Without the escape slash, the period refers to any character, which would radically change the meaning of the expression.

Other Special Characters

Special Character	Description
\n	Matches newline
\r	Matches carriage return
\t	Matches tab
\nnn	Matches ASCII character specified by octal number nnn For example, `\103` matches `C`
\xnn	Matches ASCII character specified by hexadecimal number nn For example, `\x43` matches `C`
\unnn	Matches the Unicode character specified by the four hexadecimal digits replaced by nnn
\cD	Matches a control character For example, `\cD` matches end of transmission

Regex Recommended Practices

The following are some recommended practices for regex development. All regex examples use the following log.

02/28/2007 16:55:22 MsgID=1590 : Failed authentication for user “any.user” user account locked out

Name	Recommended Pattern	Description
Negative Character Class	“[^”]” for double quote delimiters, ,[^,], for comma delimiters, or \\|[^\\|]*\\| for pipe delimiters. Can be used for any type of predictable delimiter.	Use negative character classes in log messages with clear delimiters, such as quotation marks, commas, or pipes. This will match any character that is not the delimiter. This can greatly improve parsing performance vs a more generic match, such as .?. Example: MsgID=1589.?user\s”(?<account>[^”]*)”
Non-greedy Match	.*?	If you need to match any characters until a specific set of characters appears, use this pattern. Example: MsgID=1590.?user\s”(?<account>[^”])”
Overloading Map Tags	(?<[map tag]>[regex])	Map tags should almost always be overloaded. The default regex for map tags is .* which will match everything to the end of the log. Example: MsgID=1590.?user\s”(?<account>[^”])”\s(?<tag1>.*?)$
Preceding and trailing values	Not applicable	Always match as much constant text as possible. The more information the regex has to evaluate, the faster it will be at identifying non-matching logs. For any parsed field, it is best to search for a constant value before and after the value being parsed. Example: MsgID=1590.?user\s”(?<account>[^”])”\s(?<tag1>.*?)$
Look Aheads	(?=[regex])[regex] (?![regex])[regex]	Positive and negative look ahead allows for an initial check in the regex to see if a case is satisfied in the log messages. These are useful for finding a value later in a log to reduce extraneous processing for non-matching logs. Do not use for values that appear very early in a log message, such as just past a Syslog header. Look ahead is more costly than regular expressions if the match is always found early. Example: (?=.*?match contains this phrase)\s<sip>\s
Multiline character match pattern	[\r\n]	Using a character class containing both \r (return) and \n (newline) allows for either multiline character to appear as well as both in either order. Some log messages vary in the order of these new lines and others only contain a newline.
Narrow Character Classes	[a-z0-9_]+	The shorthand character class \w matches Latin alphabet characters, Hindu-Arabic numerals (0-9), underscores as well other scripts supported in Unicode. Narrowing the match to only the relevant character set will yield better performance.