Hi, I need a regular expression engine that supports raw UTF-8 - meaning, the UTF-8 string is stored in char * as two chars(or one, or less) - for example, Ab is the array {0x41,0x62}. Anyone know of an regex engine that can receive that format? I can convert to wchar_t if needed first.
-
The current implementation of PCRE (release 7.x) corresponds approxi- mately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.
-
This page says that it is possible with Boost.Regex, on the condition that you configure and use ICU library.
-
Dealing with the non constant character length nature of UTF-8 makes it very hard to create algorithms (like regex).
It's better to convert the utf-8 string to a unicode wstring with ICU and then use the wstring variant of boost::regex
0 comments:
Post a Comment