mb_detect_string() & UTF-8

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I've got an issue with mb_detect_string() and/or a fundamental lack of
character encoding knowledge...

$string = html_entity_decode('é',ENT_COMPAT,'ISO-8859-1');    //
$encoding = mb_detect_encoding($string);
echo 'encoding = '.$encoding;  // returns UTF-8

well it's not UTF-8    and echoing $string on a page where the charset
has been set to UTF-8 will give the gibberish char

on http://www.php.net/manual/en/function.mb-detect-encoding.php
"hmdker at gmail dot com"   has posted a is_utf8() function
is_utf8($string); // returns false

I've played with mb_detect_order();   adding ISO-8859-1 before UTF-8
and it'll never properly detect a UTF-8 string...

Re: mb_detect_string() & UTF-8

Quoted text here. Click to load it

Detecting the character set is a heuristic process.  It makes a guess,
based on common, uncommon, possible, and impossible sequences.  When you
give it one single character, there's just not enough information for it to
make an accurate determination.  chr(233) is a valid caracter in virtually
every 8-bit character set, so it picks one.  If you give it more
characters, it can make a more intelligent guess, but it's still not 100%

Why do you need to do this kind of detection?
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Re: mb_detect_string() & UTF-8

BKDotCom escribió:
Quoted text here. Click to load it

The manual entry for mb_detect_encoding() mentions that, if you don't
provide a list of charsets as second argument, it uses the output of
mb_detect_order(). In my system, it looks like this:

<?php print_r( mb_detect_order() ); ?>

     [0] => ASCII
     [1] => UTF-8

Since chr(233) is not valid 7-bit ASCII, you get UTF-8 as best guess.
(After all, this is all guessing.)

-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com

Site Timeline