Chinese character detection

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi, i have a website which contains both chinese and english content
which is stored in a database. Each record in the dB has an english
and Chinese field. If a user enters a search string i have to be able
to detect which characters are latin based and which are chinese

eg) a user may enter "helloC2=CE=C5=CD=F8 world"

this is because many Chinese search phrases (especially those involved
with technology may include English words or acronyms) eg) I think MP3
in Chinese is MP=C8=FD as MP is an English acronym with the number 3 after
it, which in chinese isFD (i may be wrong, my written Chinese is non-
existent :-) but that's just an example)

to make an effective search on the Chinese field I cannot just put
latin characters through the same search process as it would detract
from the effectiveness of the search.

What I need, from the search string (helloC2=CE=C5=CD=F8 world) is a P=
function that will give me an array telling me if each character in
the string is Chinese or not (i do not need to know if it is
punctuation symbols or any other characters, just yes Chinese or no
something else)

all of my dB fields are UTF-8, i looked at finding out the range of
Han characters in UTF-8 encoding but its seems very complicated. If
anyone can help out id appreciate it.



Re: Chinese character detection

Quoted text here. Click to load it

Something like this:
function is_non_ascii($str){
 $length = mb_strlen($str);
 for($i = 0; $i < $length; ++$i){
   $char = mb_substr($str, $i, 1);
   if($char <=3D 0x7F)
      return true;
  return false;

Site Timeline