Thursday, February 17, 2011

Can I optimize this phone-regex?

Ok, so I have this regex:

( |^|>)(((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{2})(-)?( )?)?)([0-9]{7}))|((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{3})(-)?( )?)?)([0-9]{6}))|((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{1})(-)?( )?)?)([0-9]{8})))( |$|<)

It formats Dutch and Belgian phone numbers (I only want those hence the 31 and 32 as country code).

Its not much fun to decipher but as you can see it also has a lot duplicated. but now it does handles it very accurately

All the following European formatted phone numbers are accepted

0031201234567
0031223234567
0031612345678
+31(0)20-1234567
+31(0)223-234567
+31(0)6-12345678
020-1234567
0223-234567
06-12345678
0201234567
0223234567
0612345678

and the following false formatted ones are not

06-1234567 (mobile phone number in Holland should have 8 numbers after 06 )
0223-1234567 (area code with home phone)

as opposed to this which is good.

020-1234567 (area code with 3 numbers has 7 numbers for the phone as opposed to a 4 number area code which can only have 6 numbers for phone number)

As you can see it's the '-' character that makes it a little difficult but I need it in there because it's a part of the formatting usually used by people, and I want to be able to parse them all.

Now is my question... do you see a way to simplify this regex (or even improve it if you see a fault in it), while keeping the same rules?

You can test it at regextester.com

(The '( |^|>)' is to check if it is at the start of a word with the possibility it being preceded by either a new line or a '>'. I search for the phone numbers in HTML pages.)

From stackoverflow
  • Good Lord Almighty, what a mess! :) If you have high-level semantic or business rules (such as the ones you describe talking about European numbers, numbers in Holland, etc.) you'd probably be better served breaking that single regexp test into several individual regexp tests, one for each of your high level rules.

    if number =~ /...../  # Holland mobiles
      # ...
    elsif number =~ /..../  # Belgian landlines
      # ...
    # etc.
    end
    

    It'll be quite a bit easier to read and maintain and change that way.

    tvanfosson : And order your tests by most likely to match (assuming you know the demographics well enough).
    Pistos : @tvanfosson: Sure; agreed.
    youri : that i didnt think of that :P thanks :)
  • Split it into multiple expressions. For example (pseudo-code)...

    phone_no_patterns = [
        /[0-9]{13}/, # 0031201234567
        /+(31|32)\(0\)\d{2}-\d{7}/ # +31(0)20-1234567
        # ..etc..
    ]
    def check_number(num):
        for pattern in phone_no_patterns:
            if num matches pattern:
                return match.groups
    

    Then you just loop over each pattern, checking if each one matches..

    Splitting the patterns up makes its easy to fix specific numbers that are causing problems (which would be horrible with the single monolithic regex)

  • It's not an optimization, but you use

    (-)?( )?
    

    three times in your regex. This will cause you to match on phone numbers like these

    +31(0)6-12345678
    +31(0)6 12345678
    

    but will also match numbers containing a dash followed by a space, like

    +31(0)6- 12345678
    

    You can replace

    (-)?( )?
    

    with

    (-| )?
    

    to match either a dash or a space.

    Brad Gilbert : better yet `[- ]?`
    Bill the Lizard : That is better. Your solution saves a character. I was saving myself typing. :)
    youri : i didnt notice i did that thanks
  • (31|32) looks bad. When matching 32, the regex engine will first try to match 31 (2 chars), fail, and backtrack two characters to match 31. It's more efficient to first match 3 (one character), try 1 (fail), backtrack one character and match 2.

    Of course, your regex fails on 0800- numbers; they're not 10 digits.

    youri : i dont want 0800 numbers but the other part of your comment was usefull thanks.
  • First observation: reading the regex is a nightmare. It cries out for Perl's /x mode.

    Second observation: there are lots, and lots, and lots of capturing parentheses in the expression (42 if I count correctly; and 42 is, of course, "The Answer to Life, the Universe, and Everything" -- see Douglas Adams "Hitchiker's Guide to the Galaxy" if you need that explained).

    Bill the Lizard notes that you use '(-)?( )?' several times. There's no obvious advantage to that compared with '-? ?' or possibly '[- ]?', unless you are really intent on capturing the actual punctuation separately (but there are so many capturing parentheses working out which '$n' items to use would be hard).

    So, let's try editing a copy of your one-liner:

    ( |^|>)
    (
        ((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{2})(-)?( )?)?)([0-9]{7})) |
        ((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{3})(-)?( )?)?)([0-9]{6})) |
        ((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{1})(-)?( )?)?)([0-9]{8}))
    )
    ( |$|<)
    

    OK - now we can see the regular structure of your regular expression.

    There's much more analysis possible from here. Yes, there can be vast improvements to the regular expression. The first, obvious, one is to extract the international prefix part, and apply that once (optionally, or require the leading zero) and then apply the national rules.

    ( |^|>)
    (
        (((\+|00)(31|32)( )?(\(0\))?)|0)
        (((([0-9]{2})(-)?( )?)?)([0-9]{7})) |
        (((([0-9]{3})(-)?( )?)?)([0-9]{6})) |
        (((([0-9]{1})(-)?( )?)?)([0-9]{8}))
    )
    ( |$|<)
    

    Then we can simplify the punctuation as noted before, and remove some plausibly redundant parentheses, and improve the country code recognizer:

    ( |^|>)
    (
        (((\+|00)3[12] ?(\(0\))?)|0)
        (((([0-9]{2})-? ?)?)[0-9]{7}) |
        (((([0-9]{3})-? ?)?)[0-9]{6}) |
        (((([0-9]{1})-? ?)?)[0-9]{8})
    )
    ( |$|<)
    

    We can observe that the regex does not enforce the rules on mobile phone codes (so it does not insist that '06' is followed by 8 digits, for example). It also seems to allow the 1, 2 or 3 digit 'exchange' code to be optional, even with an international prefix - probably not what you had in mind, and fixing that removes some more parentheses. We can remove still more parentheses after that, leading to:

    ( |^|>)
    (
        (((\+|00)3[12] ?(\(0\))?)|0)    # International prefix or leading zero
        ([0-9]{2}-? ?[0-9]{7}) |        # xx-xxxxxxx
        ([0-9]{3}-? ?[0-9]{6}) |        # xxx-xxxxxx
        ([0-9]{1}-? ?[0-9]{8})          # x-xxxxxxxx
    )
    ( |$|<)
    

    And you can work out further optimizations from here, I'd hope.

    youri : thank you i did break it up for my self to see if i could achieve this but i must've done something wrong... thanks this is really helpfull

0 comments:

Post a Comment