Occasionally, I use this column in technology and worship tech to put some tips out there on the technical side of things. Today is such a day. This is another Regular Expression segment that might help someone.
The goal of much software is to find things. A way to find stuff in computer languages is a sub-language called "Regular Expressions". Most regular expressions deal with finding specific instances of data inside of a larger string. When looking for those instances of data, we often use "patterns" to match what we intend to look for with the data we are looking through to find our desired information. Those patterns often indicate what we are looking for, as in ".*Kim.*" (without the quotes) is a regex pattern that would look for my first name inside of any string. Any string that contained my name would match that pattern.
But in real life, we don't always know what we are looking for in a positive fashion. Sometimes we are looking for things simply because they AREN'T something else. Let's go back to my name, Kim. If I want to create a regex pattern that would match every string that did NOT contain my name, regex has a way to do that as well. It is called "negative lookaround". There are two types of negative lookaround- "negative lookahead" and "negative lookbehind". One is for looking forward into a string, the other is for looking past the current position we are at in a string. For simplicity sake, lets simply look forward, since that will be the most obvious case.
So let me clarify- what we want to do is write a pattern that will find every string that does NOT contain my name, Kim. Ok, here you go:
^(?:(?!Kim).)*$
The core of the "negativeness" of this expression is (?!Kim), which simply says match the next thing forward that doesn't equal exactly "Kim". The rest of the expression allows us to capture the entire string, from start to end. And if all you are doing is trying to make sure that you match a string that doesn't contain a specific pattern, then you are good.
However, sometimes what you are actually looking for is to find any part of any string that does not contain the negative pattern (the name for a pattern that finds a string avoiding a specific pattern). In other words, what you want to do, is look through and extract all the data from any string, except avoid the data from the negative pattern. This is actually a little more complicated, but here is one option:
^(((?:(?! Kim).)*)|((.*)Kim(.*)))$
This pattern will find lines of data that contain nothing to do with Kim, and it will capture data that is on a line with Kim (but can programmatically ignore Kim itself). But in order for this to work, you must actually use what is called captured groups. Regex programmers will understand this as the chunks of identified data that matched groups in their expression. A group in a regex expression is formed each time you use a pair of parenthesis. Using numbered groups, you can get just the information you intended. In the above case, you will need a little user code to get the right data out. So, in PHP, you would have the following code using the above pattern:
if (preg_match('/^(((?:(?!Kim).)*)|((.*)Kim(.*)))$/im', $rawstring, $regexps)) {
$clean_line= $regexps[2];
$clean_before_patt= $regexps[4];
$clean_after_patt= $regexps[5];
} else {
//failure
}
What you end up with is 3 variables as you parse through your strings. The variable "$clean_line" will contain the string that matches data that has no "Kim" in it at all. The variable "$clean_before_patt" will contain the portion of a string which preceeds the the word "Kim". The variable "$clean_after_patt" will contain the portion of a string which follows the the word "Kim". Simply evaluate the values off of those variables to determine what you want to use as you search through your strings.
Of course, you would replace "Kim" with whatever pattern you DON'T want to find in your strings.
Also, my examples use both matching ^ and $ at lines breaks and search case insensitive (on the PHP preg_match). If you want to search case sensitive simply remove the "i" flag on the preg_match pattern. Similarly, if you don't want ^ and $ to match at lines breaks, just remove the "m" flag in the same preg_match situation (your use and regex engine may have its own flavor on both these flags).
God bless, and happy coding
Kim Gentes
*YOU ARE FREE to use this algorithm in any application (commercial or personal or whatever). It comes with no warrantees. If you DO end up using this REGEX pattern, I ask (but don't require) that you please do so with the following considerations:
When using the regex, some important things to know:
Options (turned on in your language/utility): ^ and $ match at line breaks