Regex Pattern for Parsing CSV files with Embedded commas, double quotes and line breaks
While you have stumbled on KimGentes.com, you might be coming for a few different reasons. Some of you are interested in articles and resources on Christianity, music, worship and such. Others of you are interested in technology information related to church worship settings. Some other folks are programmers who are looking for helpful information on technical challenges. This particular post is a bleed over from some of my technical work in programming. Specifically, this is a post to present a solution to parsing CSV files.
Programmers understand that CSV files are simply text data files that have information stored in value fields in the file. Each of the fields is separated by commas to delimit when one value/field ends and the next begins. This is why they are called "Comma Separated Values" files (CSV for short). Anyone who is new to this concept or programming might think that writing a program to extract data from files wherein the commas separate the data fields, should be an easy task. And if that was the total sum of it, it would be quick and simple in virtually any language you could choose to do it in. But that is not the end of it. CSV files are written by a host of popular applications and read by thousands of programs as well, including almost every spreadsheet program in existence, including Microsoft Excel. When the first CSV file user started outputting values to fields and reading them in another destination, they quickly realized a limitation- if you wanted to include the literal character of a comma (,) inside of a field value itself, this could not be done, since it would be interpreted as a field separator and its value wouldn't be understood (as well as the field in which it appeared being literally chopped in half).
To overcome this problem, it's assumed that some Neanderthal software developers (back in the Jurassic era of programming) came up with an idea to allow programs to insert and read commas inside of comma separated fields. They would allow fields to be encased in double quotes as a signature that the value inside this field should be read literally (including commas) from the first double quote to the ending double quote. This worked fine and commas could now be embedded in CSV field values. But, as you can guess, these cause further problems for programs- now, the commas of the world had safe haven usage inside of comma separated values, however, double quotes now could not be included inside of a double quote encased field value. Programmers quickly realized that they couldn't keep adding special characters to allow for current special characters to be escaped (which is a way of saying interpreted as literal data without functional consequence in the interpretation of the data).
So, to avoid using other characters to escape current special meaning characters, CSV file progenitors harkened that users could escape double quotes inside of double quote encased CSV fields by placing two double quotes together in the text. This would the standard way of escaping a double quote character ("), by simply placing to double quote characters next to each other, as in "".
All this is fine for the people and programs writing the data- its simple straightforward programming to output such information. But reading CSV files that have embedded double quotes, commas and can include embedded line breaks is a complicated concept. Such is the life of a programmer :). To meet this challenge, we often use a pattern parsing language called Regex (which stands for Regular Expressions).
Regex maybe the most popular language in the programming world. It is used in literally every high level programming language we know of in the world, including Visual Basic, C#, Javascript, Java, PHP, Perl, Ruby and dozens more. It is included in several utilities such as search functions inside of UltraEdit and Ace Text. And it is included in most revisions of Unix (and other) OSes in command lines functions such as grep, Windows utilities powerGrep and so forth. Technically speaking Regex isn't a programming language on its own. It's a pattern matching engine that is often embedded inside of other languages. It became widely popular due to its inclusion primary in the Unix/Linux command line function of grep and the early web standard language of Perl. Now, most programmers can't conceive of a language that doesn't include some flavor of regex.
That all said, I have chosen to write a regex pattern that can handle parsing the fields of a CSV with all the conditions I mentioned above. There are plenty of other examples of CSV parsers around, but none seem to do the trick I was looking for, which is grandly frustrating when Excel can import and export a CSV with all the listed nuances quickly and easily. So, not finding a good solution, I have written a short CSV parsing pattern. It is below.
CSV-parser (regex pattern below)
^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$
*YOU ARE FREE to use this algorithm in any application (commercial or personal or whatever). It comes with no warrantees. If you DO end up using this REGEX pattern, please do so with the following considerations:
- Please make this notation in your source code: ©2008 Kim Anthony Gentes - FREE TO USE ANYWHERE. No Warrantees are implied or offered. This software is offered "as-is". Usable by anyone (freeware, non-commercial or personal). No support or service is offered or implied by your usage. Use of the software implies your own assumption of maintenance, liability and operability of the same. Only restriction for us: you should include this copyright notice (full text) with the code.
- Please post a response on this blog entry below (you do that by clicking on the "Comments" link at the bottom of this entry), saying you found this and are using it. I'd just like to know if its helping people and how people are using it.
When using the regex, some important things to know:
Options (turned on in your language/utility): ^ and $ match at line breaks
Description: below is a textual description of the regex pattern that may be helpful to programmers who want to understand what is happening in the regex.
Assert position at the beginning of a line (at beginning of the string or after a line break character) «^» Match the regular expression below and capture its match into backreference number 1 «(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)» Match the regular expression below and capture its match into backreference number 2 «("(?:[^"]|"")*"|[^,]*)» Match either the regular expression below (attempting the next alternative only if this one fails) «"(?:[^"]|"")*"» Match the character “"” literally «"» Match the regular expression below «(?:[^"]|"")*» Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» Match either the regular expression below (attempting the next alternative only if this one fails) «[^"]» Match any character that is NOT a “"” «[^"]» Or match regular expression number 2 below (the entire group fails if this one fails to match) «""» Match the characters “""” literally «""» Match the character “"” literally «"» Or match regular expression number 2 below (the entire group fails if this one fails to match) «[^,]*» Match any character that is NOT a “,” «[^,]*» Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» Match the regular expression below and capture its match into backreference number 3 «(,("(?:[^"]|"")*"|[^,]*))*» Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*» Match the character “,” literally «,» Match the regular expression below and capture its match into backreference number 4 «("(?:[^"]|"")*"|[^,]*)» Match either the regular expression below (attempting the next alternative only if this one fails) «"(?:[^"]|"")*"» Match the character “"” literally «"» Match the regular expression below «(?:[^"]|"")*» Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» Match either the regular expression below (attempting the next alternative only if this one fails) «[^"]» Match any character that is NOT a “"” «[^"]» Or match regular expression number 2 below (the entire group fails if this one fails to match) «""» Match the characters “""” literally «""» Match the character “"” literally «"» Or match regular expression number 2 below (the entire group fails if this one fails to match) «[^,]*» Match any character that is NOT a “,” «[^,]*» Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» Assert position at the end of a line (at the end of the string or before a line break character) «$»
Thank you for all the additional information/examples and samples from various languages! Keep posting your ideas that can help others!
thanks
Kim
References (1)
-
Response: Project management toolProject portfolio tools are essential for effective project management. Celoxis, with its robust features and user-friendly interface, stands out as a top choice for streamlining portfolio management tasks, ensuring efficient project delivery.
Reader Comments (34)
All hail regular expressions! I might have to do actual work without them.
ha! I hear ya Matt... ya.. they are good.. its nice that you can get something like PowerGrep for the PC and use it locally.. btw, if you ever need a good Regex tool (to help test patterns), check out Regex Buddy.. it's great at debugging them.
Couldn't get this to work in .net :(
Jason,
be sure to check whether your regex .net flavor has the Options turned on in your language/utility for ^ and $ to match to line breaks.
also, the core part of the pattern that matches a single field is this
("(?:[^"]|"")*"|[^,]*)
if I had a 3 field pattern all the time, I would do the following
^("(?:[^"]|"")*"|[^,]*),("(?:[^"]|"")*"|[^,]*),("(?:[^"]|"")*"|[^,]*)$
some patterns are better set as predetermined scope (such as 3 reps like above than using the unknown * iterator if you aren't sure how to unwind the stack once you find the pattern).
Kim
you mentioned the .net implementation... if I used VB.Net, i would write my 3 field iterator as follows:
Try
Dim RegexObj As New Regex("^(""(?:[^""]|"""")*""|[^,]*),(""(?:[^""]|"""")*""|[^,]*),(""(?:[^""]|"""")*""|[^,]*)$", RegexOptions.Multiline)
Dim MatchResults As Match = RegexObj.Match(SubjectString)
While MatchResults.Success
' matched text: MatchResults.Value
' match start: MatchResults.Index
' match length: MatchResults.Length
MatchResults = MatchResults.NextMatch()
End While
Catch ex As ArgumentException
'Syntax error in the regular expression
End Try
I use biterscripting a lot for parsing of CSV files. They have a sample script at http://www.biterscripting.com/Download/SS_CSV.txt . You can parse by string, line, word (separated by commas), character. Further, you can change the delimiter values, even between lines of the same file, even depending on previous lines, by changing variables $wsep and $lsep.
Patrick
Kim--
Using your Regex for some in-house code. Made the following changes for it to work on C#:
string pattern = @"(""(?:[^""]|"""")*""|[^,]*)(?:,(""(?:[^""]|"""")*""|[^,]*))*";
Note removed the start and end line anchors since I'm processing line-by-line. Easy enough to add them back in.
Thanks for posting your Regex. The english-breakdown helped!
James
I found your parser, and was about to use it (Thanks!) until I found this site:
http://www.programmersheaven.com/user/Jonathan/blog/73-Splitting-CSV-with-regex/
whose author, along with Guillaume Roderick, came up with the wonderful:
Ruby: string.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)
Perl: $fields = preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", $string)
Splitting a string (or line) means putting every piece into an array (or list), which is almost always what you want to do. These lookahead regexps are explained on Jonathan's site. They are fully general.
Thanks for the original idea.
I could not make the repeating group work in C#, so have created this modified one and use Matches to find the values. use with the ExplicitCapture flag.
Groups[1] contains the result for each match.
(((?<x>(?=[,\r\n]+))|"(?<x>([^"]|"")+)"|(?<x>[^,\r\n]+)),?)
Explanation:
Match type 1: a lookahead checks if we have an empty element - if it finds a comma or a newline. this results in an empty match.
or
Match type 2: if 1 fails, check for a quoted string
or
Match type 3: then for a non-quoted string. this is taken from the original regex.
Finally, consume a comma if there is one.
Behavior:
- newline characters inside "" will become part of the value.
- empty lines outside "" result in one empty value
- escaped quotes inside quoted strings ("") need to be replaced afterwards!
note: if the last line is empty AND the previous line had a comma at the end, no extra result is given. I have no fix for this yet.
C# code to create arrays from csv lines: (uses linq)
string[] SplitCsvLine(string line)
{
return (from System.Text.RegularExpressions.Match m in System.Text.RegularExpressions.Regex.Matches(line,
@"(((?<x>(?=[,\r\n]+))|""(?<x>([^""]|"""")+)""|(?<x>[^,\r\n]+)),?)",
System.Text.RegularExpressions.RegexOptions.ExplicitCapture)
select m.Groups[1].Value).ToArray();
}
Good luck.
I had become intrigued by the use of Regular Expressions as a way to parse CSV files. I began a project to test these formulas as soon as I learned of their potential as a solution. I was hoping to find some sample code or even a distributable DLL that could be employed for this purpose. So far the most definitive work I’ve seen was the regular expression formulas shared on this blog.
The formula provided by the original author of this blog for some reason does not work for me. When I use this formula on a single string record, it returns the entire record back to me in the first member of the collection created by the MS Regx process. I’m using the Microsoft vbscript.dll version 5.5 in a VB6 program. I’m referring to the following formula:
^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$
I then tried the shorter version offered by Kim and had much better results.
("(?:[^"]|"")*"|[^,]*)
The problem with this second formula is that it creates an extra empty string field between each data column. I could live with this except I wasn’t sure if this would be consistent for any type of CSV format or if there was something about my test data that was causing this. I tried running more exotic string combinations and unfortunately the formula didn’t work. Here is the exotic string to which I was referring:
"000523"",1",",HOLTSVILLE"",""40.81518"",""-73.0455"",""5"",""25"",""103"",""Y"",""V13916"",""U"",""""",,A
(note: The above two lines should be a single line)
The correct and actual field breaks (as can be verified with Excel) create four fields and occur as follows:
000523",1
,HOLTSVILLE","40.81518","-73.0455","5","25","103","Y","V13916","U",""
(empty field)
A
I realize this is a tough string to parse but of course, that was the intention. In any event, so far I’ve yet to find a RegX formula that would pass the test. And yet, Excel has no problem with it at all, which still amazes me.
I finally decided to write my own DLL which I intend to use whenever again faced with trying to read and process a correctly formatted CVS file. I realize XML is the chosen way to go for data exchange these days and in many ways find XML superior to flat text files, but there are still legacy processes around and it is nice to know I can handle the situation should it become necessary. Also, there are some files so large that I find XML to be impractical. I’m referring to flat files that or 10 or 15 gig in size. I’ve never tried to process an XML file this large and wonder whether or not it is possible.
Testing my new DLL, I learned that this process is highly susceptible to attain self-serving results. That is why I’ve tried to develop some very unfriendly data to challenge the logic.
As a way to maintain sanity while testing, I chose a simple approach. My “proof of process” was that if Excel can parse the record correctly then I should be able to parse it in exactly the same way with the same results. So I used Excel to verify every string I’ve subjected to my parsing routine. If I attained the same results, I took it to mean that my process was correct.
I’ve used test media from a number of sources, including some I've created using Excel. These were chosen with the sole purpose of trying to confound the logic with highly improbably data. Nevertheless I've no doubt there are probably perfectly valid files out there that my program might not be able to handle. I certainly have not exercised all the conditions that have been included in the logic. If anyone would care to contribute some complicated CSV format files for me to test, I would appreciate hearing from you. You may email them to me but please limit the size of the file(s) to under 25K. Again, I stress, these data must be able to pass the “Excel” test for me to consider them as a viable challenge.
As time permits, I will return the results and if the DLL proves viable, it can be shared should there still be anyone out there still using VB6. Eventually, I will make this a VB.Net program after I am sure it is working.
Thanks for your notes all.. Especially all those who have chosen to post your updates and share your code here so others could benefit. VBasic, if you feel free to, we'd love to see what you came up with too... Thanks for posting your thoughts and notes.. Feel free to add your own flavors and updates all, as you have them for your specific apps.
Kim
Thanks so much for posting this, and all the helpful comments evrybody's contributed. It's just what I needed and saved me a load of time!
Thanks for the post! I am trying to convert a nasty CSV file into SQL INSERT statements and this helped a great deal.
Thanks for the post! It helped me tremendously in parsing the CSV format produced by the schtasks</> command in Windows.
Thank you so much for making this available. A DEFINITE life saver.
Thanks a lot this is great!
This version doesn't capture spaces outside of quotes, and treats an empty quoted field (..., "foo", "", "bar"...) as an empty string rather than as a single quote.
// Not escaped: \s?((?<x>(?=[,]+))|"(?<x>([^"]|"")+)"|"(?<x>)"|(?<x>[^,]+)),?
var matches = Regex.Matches(line,
@"\s?((?<x>(?=[,]+))|""(?<x>([^""]|"""")+)""|""(?<x>)""|(?<x>[^,]+)),?",
RegexOptions.ExplicitCapture);
var values = (from Match m in matches
select m.Groups["x"].Value.Trim().Replace("\"\"", "\"")).ToArray();
Thanks for getting me started!
Great Pattern; using for classic ASP-based website. Thanks.
CSV's never go away.
Hopefully the insights gained from this page may result in a match.
LOL! CSV humor.. very funny Sal..!
I use the Vim text editor gvim to interactively develop regexps. Vim is helpful because you can tell it to highlight and jump to intermediate results as you enter and edit the regexp on the search command line. I then translate the finished product to the regexp flavor for the language I am programming in.
I translated your ideas into a Vim regex that matches individual CSV fields
"\%(\_[^"]\+\|""\)*"\|[^",]*
The \_ prefix to the [^"] character class allows newlines to be included
\%( ... \) is the Vim idiom for non-capture grouping
Note, the quoted-field handling part must come first!
The only major difference I have is to append \+ after the first character class, and to exclude commas as well as double quotes in the final character class. Not sure if that breaks some special case, but it does test out well using this command...
%s/"\%(\_[^"]\+\|""\)*"\|[^",]*/<&>/g
...on this challenging example from Wikipedia's article on Comma-separated values:
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
well done, Jim. Thanks so much for adding your contribution to the discussion. I honestly hadn't heard of VIM, but it sounds intriguing. A number of regex building tools exist now that really take the syntax to more of a work output, and you can use wizards to develop the pattern themes, but it sounds like VIM is an interactive tool for building them real time.. cool concept. Regex Buddy has some of those components and since I fancy myself as a bit of a purest, I like trying to work out the regex in raw form rather than wizards. But it is nice, even when doing this, to see instant testing results, especially when the depth of groupings of captured and uncaptured segments get deeper and deeper.
Keep on regexing, people! Thanks for all the contributions on this post. I hear regularly from people that it continues to help folks to find what they are looking for in this area of CSV files specifically.
Brilliant ! Thanks for posting
I'm so happy I found this, this was going to be a nightmare for me.
Anybody has regex for php for text comma separated file for eg: firstname,lastname,address,city,province,postalcode
What would be the pattern for the following?
Either empty string, or.
csv with a twist that there would be comma before first entry and after last entry as well.
Example 1: Empty string
Example 2: ,abc,
Example 3: ,abc,def,ghi,
The work from H Brouwer gave me success.
I'd like to use this with sed. As a test, I try to put AAAA and ZZZZ as obvious field markers. But
sed 's/("(?:[^"]|"")*"|[^,]*)/AAAA & XXXX/g' inputfile
seems to just output the original file. No AAAA and ZZZZ markers.
it's as if the regex doesn't match anything.
Hints?
Thanks for the internet, and thanks for sharing.
Hi, I would like to break Read CSV file into an array by splitting it on \n. But the problem is that when i split the CSV read string, it also breaks the Multi Line Text into multiple lines which is totally wrong. Can you please give me a VBA version how to escape \n while reading CSV.
Regards
Murtaza
I'm using a modified version of this in my source code.
Hi, thanks for the hits of the regex!
Worked for me. Thank you
Thanks for the fantastic writeup - I've added this into a .net import script component in a SSIS package!