Thursday, February 3, 2011

Split a string ignoring quoted sections

Given a string like this:

a,"string, with",various,"values, and some",quoted

What is a good algorithm to split this based on commas while ignoring the commas inside the quoted sections?

The output should be an array:

[ "a", "string, with", "various", "values, and some", "quoted" ]

  • What if an odd number of quotes appear in the original string?

    J c : That would imply an improperly quoted input string, so throwing an exception would be an option.
    From dulaneyb
  • Of course using a CSV parser is better but just for the fun of it you could:

    Loop on the string letter by letter.
        If current_letter == quote : 
            toggle inside_quote variable.
        Else if (current_letter ==comma and not inside_quote) : 
            push current_word into array and clear current_word.
        Else 
            append the current_letter to current_word
    When the loop is done push the current_word into array
    
    From Pat
  • The author here dropped in a blob of C# code that handles the scenario you're having a problem with:

    http://stackoverflow.com/questions/1898/csv-file-imports-in-net#6111

    Shouldn't be too difficult to translate.

    From Kev
  • If my language of choice didn't offer of way to do this without thinking then I would initially consider two options as the easy way out:

    1. Pre-parse and replace the commas within the string with another control character then split them, followed by a post-parse on the array to replace the control character used previously with the commas.

    2. Alternatively split them on the commas then post-parse the resulting array into another array checking for leading quotes on each array entry and concatanating the entries until I reached a terminating quote.

    These are hacks however, and if this is a pure 'mental' exercise then I suspect they are unhelpful. If this is a real world problem then it would help to know the language so that we could offer some specific advice.

    From Martin
  • I use this to parse strings, not sure if it helps here; but with some minor modifications perhaps?

    function getstringbetween($string, $start, $end){
    $string = " ".$string;
    $ini = strpos($string,$start);
    if ($ini == 0) return "";
    $ini += strlen($start);
    $len = strpos($string,$end,$ini) - $ini;
    return substr($string,$ini,$len);
    }

    $fullstring = "this is my [tag]dog[/tag]";
    $parsed = getstringbetween($fullstring, "[tag]", "[/tag]");

    echo $parsed; // (result = dog)

    /mp

  • Here's a simple algorithm:

    1. Determine if the string begins with a '"' character
    2. Split the string into an array delimited by the '"' character.
    3. Mark the quoted commas with a placeholder #COMMA#
      • If the input starts with a '"', mark those items in the array where the index % 2 == 0
      • Otherwise mark those items in the array where the index % 2 == 1
    4. Concatenate the items in the array to form a modified input string.
    5. Split the string into an array delimited by the ',' character.
    6. Replace all instances in the array of #COMMA# placeholders with the ',' character.
    7. The array is your output.

    Heres the python implementation:
    (fixed to handle '"a,b",c,"d,e,f,h","i,j,k"')

    def parse_input(input):

    quote_mod = int(not input.startswith('"'))

    input = input.split('"')
    for item in input:
    if item == '':
    input.remove(item)
    for i in range(len(input)):
    if i % 2 == quoted_mod:
    input[i] = input[i].replace(",", "#COMMA#")

    input = "".join(input).split(",")
    for item in input:
    if item == '':
    input.remove(item)
    for i in range(len(input)):
    input[i] = input[i].replace("#COMMA#", ",")
    return input

    # parse_input('a,"string, with",various,"values, and some",quoted')
    # -> ['a,string', ' with,various,values', ' and some,quoted']
    # parse_input('"a,b",c,"d,e,f,h","i,j,k"')
    # -> ['a,b', 'c', 'd,e,f,h', 'i,j,k']
  • This is a standard CSV-style parse. A lot of people try to do this with regular expressions. You can get to about 90% with regexes, but you really need a real CSV parser to do it properly. I found a fast, excellent C# CSV parser on CodeProject a few months ago that I highly recommend!

    MarkJ : There's also one in the .NET framework of course. Even though it's in Microsoft.VisualBasic you could still use it from C#. http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx
    Simon Gillbee : Thanks! As a C# guy, I always forget that there are a bunch of useful VB libraries out there that I can use. Honestly, I think they are poorly named, since they are not really VB. They are just .NET.
  • Justin, that does not handle a string like this:

    input = '"a,b",c,"d,e,f,h","i,j,k"'

  • # Here's one in pseudocode (a.k.a. Python) in one pass :-P

    def parsecsv(instr):
    i = 0
    j = 0

    outstrs = []

    # i is fixed until a match occurs, then it advances
    # up to j. j inches forward each time through:

    while i < len(instr):

    if j < len(instr) and instr[j] == '"':
    # skip the opening quote...
    j += 1
    # then iterate until we find a closing quote.
    while instr[j] != '"':
    j += 1
    if j == len(instr):
    raise Exception("Unmatched double quote at end of input.")

    if j == len(instr) or instr[j] == ',':
    s = instr[i:j] # get the substring we've found
    s = s.strip() # remove extra whitespace

    # remove surrounding quotes if they're there
    if len(s) > 2 and s[0] == '"' and s[-1] == '"':
    s = s[1:-1]

    # add it to the result
    outstrs.append(s)

    # skip over the comma, move i up (to where
    # j will be at the end of the iteration)
    i = j+1

    j = j+1

    return outstrs

    def testcase(instr, expected):
    outstr = parsecsv(instr)
    print outstr
    assert expected == outstr

    # Doesn't handle things like '1, 2, "a, b, c" d, 2' or
    # escaped quotes, but those can be added pretty easily.

    testcase('a, b, "1, 2, 3", c', ['a', 'b', '1, 2, 3', 'c'])
    testcase('a,b,"1, 2, 3" , c', ['a', 'b', '1, 2, 3', 'c'])

    # odd number of quotes gives a "unmatched quote" exception
    #testcase('a,b,"1, 2, 3" , "c', ['a', 'b', '1, 2, 3', 'c'])
  • Looks like you've got some good answers here.

    For those of you looking to handle your own CSV file parsing, heed the advice from the experts and Don't roll your own CSV parser.

    Your first thought is, "I need to handle commas inside of quotes."

    Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

    It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free FileHelpers library.

    MarkJ : +10 if it would let me :)
    Magnus Smith : a great link on secretgeek - very amusing. but it only answers the question for those using .NET sadly.
    Judah Himango : True; though the advice applies to devs everywhere: don't role your own CSV parser. Ruby has one built-in, and there are libraries out there for Python, C++, most any widely used language.
    MarkJ : And although SecretGeek doesn't seem to know it, there's one built in to VB.NET too. http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx
  • I just couldn't resist to see if I could make it work in a Python one-liner:

    arr = [i.replace("|", ",") for i in re.sub('"([^"]*)\,([^"]*)"',"\g<1>|\g<2>", str_to_test).split(",")]

    Returns ['a', 'string, with', 'various', 'values, and some', 'quoted']

    It works by first replacing the ',' inside quotes to another separator (|), splitting the string on ',' and replacing the | separator again.

    MarkJ : How do you know there aren't any | in the original string? What about escaping quotes inside quoted strings?
    From PabloG
  • Python:

    import csv
    reader = csv.reader(open("some.csv"))
    for row in reader:
        print row
    
    Alex. S. : I consider this as the best answer. Is exactly what I need!
  • What if an odd number of quotes appear in the original string?

    This looks uncannily like CSV parsing, which has some peculiarities to handling quoted fields. The field is only escaped if the field is delimited with double quotations, so:

    field1, "field2, field3", field4, "field5, field6" field7

    becomes

    field1

    field2, field3

    field4

    "field5

    field6" field7

    Notice if it doesn't both start and end with a quotation, then it's not a quoted field and the double quotes are simply treated as double quotes.

    Insedently my code that someone linked to doesn't actually handle this correctly, if I recall correctly.

    From ICR
  • Here's a simple python implementation based on Pat's pseudocode:

    def splitIgnoringSingleQuote(string, split_char, remove_quotes=False):
        string_split = []
        current_word = ""
        inside_quote = False
        for letter in string:
          if letter == "'":
            if not remove_quotes:
               current_word += letter
            if inside_quote:
              inside_quote = False
            else:
              inside_quote = True
          elif letter == split_char and not inside_quote:
            string_split.append(current_word)
            current_word = ""
          else:
            current_word += letter
        string_split.append(current_word)
        return string_split
    
    From Greg Bayer

0 comments:

Post a Comment