Worship Tech/Web Tools Blog - Kim Gentes - worship leader and writer

Worship Tech Web Tools Blog

This is an ongoing blog of web tools and technology related to worship, music and church. The idea is to give you good web points and resources that you can go to. Some of it is just me cruising the net, others are favorites of friends.

Enjoy what you see here. If you find an interesting, useful and technology related site or resource that deals with helping worship or musicians in general, please send us a note and we will check it out. Perhaps we can feature it here.

Thanks!

Enjoy! - Kim Gentes

Entries in regex (2)

Regular Expressions Except a Given String - Negative Patterns (Kim Gentes Worship & Tech Blog)

Saturday, December 20, 2008 at 12:46AM

Occasionally, I use this column in technology and worship tech to put some tips out there on the technical side of things. Today is such a day. This is another Regular Expression segment that might help someone.

The goal of much software is to find things. A way to find stuff in computer languages is a sub-language called "Regular Expressions". Most regular expressions deal with finding specific instances of data inside of a larger string. When looking for those instances of data, we often use "patterns" to match what we intend to look for with the data we are looking through to find our desired information. Those patterns often indicate what we are looking for, as in ".*Kim.*" (without the quotes) is a regex pattern that would look for my first name inside of any string. Any string that contained my name would match that pattern.

But in real life, we don't always know what we are looking for in a positive fashion. Sometimes we are looking for things simply because they AREN'T something else. Let's go back to my name, Kim. If I want to create a regex pattern that would match every string that did NOT contain my name, regex has a way to do that as well. It is called "negative lookaround". There are two types of negative lookaround- "negative lookahead" and "negative lookbehind". One is for looking forward into a string, the other is for looking past the current position we are at in a string. For simplicity sake, lets simply look forward, since that will be the most obvious case.

So let me clarify- what we want to do is write a pattern that will find every string that does NOT contain my name, Kim. Ok, here you go:

^(?:(?!Kim).)*$

The core of the "negativeness" of this expression is (?!Kim), which simply says match the next thing forward that doesn't equal exactly "Kim". The rest of the expression allows us to capture the entire string, from start to end. And if all you are doing is trying to make sure that you match a string that doesn't contain a specific pattern, then you are good.

However, sometimes what you are actually looking for is to find any part of any string that does not contain the negative pattern (the name for a pattern that finds a string avoiding a specific pattern). In other words, what you want to do, is look through and extract all the data from any string, except avoid the data from the negative pattern. This is actually a little more complicated, but here is one option:

^(((?:(?! Kim).)*)|((.*)Kim(.*)))$

This pattern will find lines of data that contain nothing to do with Kim, and it will capture data that is on a line with Kim (but can programmatically ignore Kim itself). But in order for this to work, you must actually use what is called captured groups. Regex programmers will understand this as the chunks of identified data that matched groups in their expression. A group in a regex expression is formed each time you use a pair of parenthesis. Using numbered groups, you can get just the information you intended. In the above case, you will need a little user code to get the right data out. So, in PHP, you would have the following code using the above pattern:

if (preg_match('/^(((?:(?!Kim).)*)|((.*)Kim(.*)))$/im', $rawstring, $regexps)) {
  $clean_line= $regexps[2];
  $clean_before_patt= $regexps[4];
  $clean_after_patt= $regexps[5];
} else {
  //failure
}

What you end up with is 3 variables as you parse through your strings. The variable "$clean_line" will contain the string that matches data that has no "Kim" in it at all. The variable "$clean_before_patt" will contain the portion of a string which preceeds the the word "Kim". The variable "$clean_after_patt" will contain the portion of a string which follows the the word "Kim". Simply evaluate the values off of those variables to determine what you want to use as you search through your strings.

Of course, you would replace "Kim" with whatever pattern you DON'T want to find in your strings.

Also, my examples use both matching ^ and $ at lines breaks and search case insensitive (on the PHP preg_match). If you want to search case sensitive simply remove the "i" flag on the preg_match pattern. Similarly, if you don't want ^ and $ to match at lines breaks, just remove the "m" flag in the same preg_match situation (your use and regex engine may have its own flavor on both these flags).

God bless, and happy coding

Kim Gentes

*YOU ARE FREE to use this algorithm in any application (commercial or personal or whatever). It comes with no warrantees. If you DO end up using this REGEX pattern, I ask (but don't require) that you please do so with the following considerations:

Please make this notation in your source code: ©2008 Kim Anthony Gentes - FREE TO USE ANYWHERE.
Please post a response on this blog entry below (you do that by clicking on the "Comments" link at the bottom of this entry), saying you found this and are using it. I'd just like to know if its helping people and how people are using it.

When using the regex, some important things to know:

Options (turned on in your language/utility): ^ and $ match at line breaks

Kim Gentes |

7 Comments |

tagged

exception strings,

kim gentes,

regex,

regular expressions in

Programming,

Regex,

Software

Regex Pattern for Parsing CSV files with Embedded commas, double quotes and line breaks

Tuesday, October 14, 2008 at 5:32AM

While you have stumbled on KimGentes.com, you might be coming for a few different reasons. Some of you are interested in articles and resources on Christianity, music, worship and such. Others of you are interested in technology information related to church worship settings. Some other folks are programmers who are looking for helpful information on technical challenges. This particular post is a bleed over from some of my technical work in programming. Specifically, this is a post to present a solution to parsing CSV files.

Programmers understand that CSV files are simply text data files that have information stored in value fields in the file. Each of the fields is separated by commas to delimit when one value/field ends and the next begins. This is why they are called "Comma Separated Values" files (CSV for short). Anyone who is new to this concept or programming might think that writing a program to extract data from files wherein the commas separate the data fields, should be an easy task. And if that was the total sum of it, it would be quick and simple in virtually any language you could choose to do it in. But that is not the end of it. CSV files are written by a host of popular applications and read by thousands of programs as well, including almost every spreadsheet program in existence, including Microsoft Excel. When the first CSV file user started outputting values to fields and reading them in another destination, they quickly realized a limitation- if you wanted to include the literal character of a comma (,) inside of a field value itself, this could not be done, since it would be interpreted as a field separator and its value wouldn't be understood (as well as the field in which it appeared being literally chopped in half).

To overcome this problem, it's assumed that some Neanderthal software developers (back in the Jurassic era of programming) came up with an idea to allow programs to insert and read commas inside of comma separated fields. They would allow fields to be encased in double quotes as a signature that the value inside this field should be read literally (including commas) from the first double quote to the ending double quote. This worked fine and commas could now be embedded in CSV field values. But, as you can guess, these cause further problems for programs- now, the commas of the world had safe haven usage inside of comma separated values, however, double quotes now could not be included inside of a double quote encased field value. Programmers quickly realized that they couldn't keep adding special characters to allow for current special characters to be escaped (which is a way of saying interpreted as literal data without functional consequence in the interpretation of the data).

So, to avoid using other characters to escape current special meaning characters, CSV file progenitors harkened that users could escape double quotes inside of double quote encased CSV fields by placing two double quotes together in the text. This would the standard way of escaping a double quote character ("), by simply placing to double quote characters next to each other, as in "".

All this is fine for the people and programs writing the data- its simple straightforward programming to output such information. But reading CSV files that have embedded double quotes, commas and can include embedded line breaks is a complicated concept. Such is the life of a programmer :). To meet this challenge, we often use a pattern parsing language called Regex (which stands for Regular Expressions).

Regex maybe the most popular language in the programming world. It is used in literally every high level programming language we know of in the world, including Visual Basic, C#, Javascript, Java, PHP, Perl, Ruby and dozens more. It is included in several utilities such as search functions inside of UltraEdit and Ace Text. And it is included in most revisions of Unix (and other) OSes in command lines functions such as grep, Windows utilities powerGrep and so forth. Technically speaking Regex isn't a programming language on its own. It's a pattern matching engine that is often embedded inside of other languages. It became widely popular due to its inclusion primary in the Unix/Linux command line function of grep and the early web standard language of Perl. Now, most programmers can't conceive of a language that doesn't include some flavor of regex.

That all said, I have chosen to write a regex pattern that can handle parsing the fields of a CSV with all the conditions I mentioned above. There are plenty of other examples of CSV parsers around, but none seem to do the trick I was looking for, which is grandly frustrating when Excel can import and export a CSV with all the listed nuances quickly and easily. So, not finding a good solution, I have written a short CSV parsing pattern. It is below.

CSV-parser (regex pattern below)

^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$

*YOU ARE FREE to use this algorithm in any application (commercial or personal or whatever). It comes with no warrantees. If you DO end up using this REGEX pattern, please do so with the following considerations:

Please make this notation in your source code: ©2008 Kim Anthony Gentes - FREE TO USE ANYWHERE. No Warrantees are implied or offered. This software is offered "as-is". Usable by anyone (freeware, non-commercial or personal). No support or service is offered or implied by your usage. Use of the software implies your own assumption of maintenance, liability and operability of the same. Only restriction for us: you should include this copyright notice (full text) with the code.
Please post a response on this blog entry below (you do that by clicking on the "Comments" link at the bottom of this entry), saying you found this and are using it. I'd just like to know if its helping people and how people are using it.

When using the regex, some important things to know:

Options (turned on in your language/utility): ^ and $ match at line breaks

Description: below is a textual description of the regex pattern that may be helpful to programmers who want to understand what is happening in the regex.

Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match the regular expression below and capture its match into backreference number 1 «(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)»
   Match the regular expression below and capture its match into backreference number 2 «("(?:[^"]|"")*"|[^,]*)»
      Match either the regular expression below (attempting the next alternative only if this one fails) «"(?:[^"]|"")*"»
         Match the character “"” literally «"»
         Match the regular expression below «(?:[^"]|"")*»
            Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
            Match either the regular expression below (attempting the next alternative only if this one fails) «[^"]»
               Match any character that is NOT a “"” «[^"]»
            Or match regular expression number 2 below (the entire group fails if this one fails to match) «""»
               Match the characters “""” literally «""»
         Match the character “"” literally «"»
      Or match regular expression number 2 below (the entire group fails if this one fails to match) «[^,]*»
         Match any character that is NOT a “,” «[^,]*»
            Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the regular expression below and capture its match into backreference number 3 «(,("(?:[^"]|"")*"|[^,]*))*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
      Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «*»
      Match the character “,” literally «,»
      Match the regular expression below and capture its match into backreference number 4 «("(?:[^"]|"")*"|[^,]*)»
         Match either the regular expression below (attempting the next alternative only if this one fails) «"(?:[^"]|"")*"»
            Match the character “"” literally «"»
            Match the regular expression below «(?:[^"]|"")*»
               Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
               Match either the regular expression below (attempting the next alternative only if this one fails) «[^"]»
                  Match any character that is NOT a “"” «[^"]»
               Or match regular expression number 2 below (the entire group fails if this one fails to match) «""»
                  Match the characters “""” literally «""»
            Match the character “"” literally «"»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «[^,]*»
            Match any character that is NOT a “,” «[^,]*»
               Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert position at the end of a line (at the end of the string or before a line break character) «$»

Thank you for all the additional information/examples and samples from various languages! Keep posting your ideas that can help others!

thanks

Kim

Kim Gentes |

34 Comments |

1 Reference |

tagged

CSV,

comma,

commas,

embedded,

excel,

programming,

quotes,

regex,

regular,

regular expressions,

separated,

values in

CSV,

Programming,

Regex