String Validation With UTF-8 Support

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I am looking for a way to check whether a string contains only word
characters and a single space (!= any whitespace char), *regardless of
the current locale*. In other words, any character that is a word
character in any locale should be allowed. This check:

preg_match("/^[\w ]*$/", $_GET[whatever]);

in which the $_GET variable contains an UTF-8 encoded string, only
seems to work with whatever locale is currently defined. Of course, I
could change the locale using setlocale(), but that would still limit
the check to a subset of all possible input values.

I also created this function from information that I found on the web:

  function is_utf8($_string) {
    return preg_match('/^([\x00-\x7f]|'
                    . '[\xc2-\xdf][\x80-\xbf]|'
                    . '\xe0[\xa0-\xbf][\x80-\xbf]|'
                    . '[\xe1-\xec][\x80-\xbf]|'
                    . '\xed[\x80-\x9f][\x80-\xbf]|'
                    . '[\xee-\xef][\x80-\xbf]|'
                    . 'f0[\x90-\xbf][\x80-\xbf]|'
                    . '[\xf1-\xf3][\x80-\xbf]|'
                    . '\xf4[\x80-\x8f][\x80-\xbf])*$/',
                      $_string) > 0;

However, this does not seem to be completely accurate, as it still
allows characters such as this:
(sorry for the external link, I just don't know how to create such
characters here.)

According to the W3C Validator, those characters are still invalid.

I know there must be an answer somewhere on the web already, but I have
not found any reference in Google nor in the archives of this

Any help appreciated.


Re: String Validation With UTF-8 Support


I hope I got your problem right. In the PHP Manual contributed notes
theres a very good function to validate (and proof) UTF-8 encoded data.

It works perfectly for me. This function returns false when the given
text has chars in it, which are not part of the UTF-8 standard i.e.
ISO/ANSI above 128. If your Webpage has the correct meta-tag (charset
UTF-8) or the corresponding header (look in the php.ini, there's a
default setting!), the browser should then send you UTF-8 encoded data.

By the way have a look at the mb_string extension. It delivers a set of
string functions that replace the existing php functions which don't
support multi-byte char strings.

Hope that helped you a bit.

Benjamin Wilger

Site Timeline