Perl equiv to ICU BreakIterator

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Is there a Perl equivalent (in CORE or a module) which will allow me to
iterator over words. The gotcha is that it should support I18N, including
CJK languages. The purpose is for tokenization for a Bayesian system.

I'm looking for something similar to IBM's ICU BreakIterator functionality
(available in C, C++ and Java, but of course not Perl :( /

I had hoped that /\b/ would "just work", but it doesn't seem to know much
about UTR #29. And String::Multibyte doesn't look promosing either. It's
split() method doesn't allow patterns, and thus I suppose doesn't allow a
way to split a string into words (of course, figuring out how to get a
"full" string to split on is another hurdle).



Site Timeline