Do you have a question? Post it now! No Registration Necessary. Now with pictures!
- William Ahern
September 13, 2004, 8:32 pm
rate this thread
iterator over words. The gotcha is that it should support I18N, including
CJK languages. The purpose is for tokenization for a Bayesian system.
I'm looking for something similar to IBM's ICU BreakIterator functionality
(available in C, C++ and Java, but of course not Perl :(
I had hoped that /\b/ would "just work", but it doesn't seem to know much
about UTR #29. And String::Multibyte doesn't look promosing either. It's
split() method doesn't allow patterns, and thus I suppose doesn't allow a
way to split a string into words (of course, figuring out how to get a
"full" string to split on is another hurdle).