Tuesday, March 15, 2011

Regular Expression engine that supports raw UTF-8?

Hi, I need a regular expression engine that supports raw UTF-8 - meaning, the UTF-8 string is stored in char * as two chars(or one, or less) - for example, Ab is the array {0x41,0x62}. Anyone know of an regex engine that can receive that format? I can convert to wchar_t if needed first.

From stackoverflow
  • The current implementation of PCRE (release 7.x) corresponds approxi- mately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.

  • This page says that it is possible with Boost.Regex, on the condition that you configure and use ICU library.

  • Dealing with the non constant character length nature of UTF-8 makes it very hard to create algorithms (like regex).

    It's better to convert the utf-8 string to a unicode wstring with ICU and then use the wstring variant of boost::regex

0 comments:

Post a Comment