is any work being done to fix/improve PHP's string handling beyond 8 bits?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

Last year I asked a bunch of questions about character encoding on this
newsgroup. All the answers came down to using ord() in creative ways to
try to make guesses about multi-byte characters. I was a little amazed
at this and wondered if I'd somehow misunderstood the situation.

I'm pleased to find that Joel Spolsky shared my amazement and offered
some criticism of PHP on these grounds: "When I discovered that the
popular web development tool PHP has almost complete ignorance of
character encoding issues, blithely using 8 bits for characters, making
it darn near impossible to develop good international web applications,
I thought, enough is enough."

But his essay is a year older than even the questions I had last year.
So I'm left wondering, is any work being done to fix the situation? I
just looked at and saw no
new functions for handling multi-byte characters. Is anything being
done on this front?

And why aren't a lot of people asking these questions? Once again I'm
wondering if perhaps I've misunderstood something, somewhere. Isn't
this an issue that effects pretty much all of us using PHP on the web?
How are any of the people reading this post dealing with their own
character encoding issues?

Joel Spolsky's essay is here:

Re: is any work being done to fix/improve PHP's string handling beyond 8 bits?

If there's one person who's qualified to talk about multilingual
programming and PHP, that person would be me. In the last couple years
I have been working on a content management system dealing with
materials in such languages as Korean, Pashto, Georgian, Ethiphic, and
Chechen. And let me tell you, whether the server-side technology you
use can "natively" support Unicode is the least of your problems.

PHP is basically encoding agnostic. By in large, this is good enough.
Most of the issues you encounter in multilingual application
development is on the display side. For example, how to get the page
layout to look correctly when you have to flip it for a right-to-left
language. Only on rarely does the server-side application need to
"understand" what it sends or receives.  By default, you can't do much
with the text in a multilingual situation, because the scripts behave
so differently.  In our application, for instance, we have to ask our
users to enter the word count, because for languages like Chinese where
no spaces appear between words, the computer can't do it automatically.

If you ask me, the 8-bit strings in PHP cut both ways. There are
occasions when I wish I can get the Unciode value of a specific
character (quite difficult in standard PHP). Yet there are also times
when I appreciate the fact that PHP isn't fiddling with the text that's

Re: is any work being done to fix/improve PHP's string handling beyond 8 bits?

I don't think the problem is that PHP focuses on 8 bit strings, I think
the problem is the lack of default, built-in functions for dealing with
multibyte strings.

Re: is any work being done to fix/improve PHP's string handling beyond 8 bits?

Yeah, a set of functions that treats a regular string as a UTF-16
string would be quite useful.

Re: is any work being done to fix/improve PHP's string handling beyond 8 bits?


Quoted text here. Click to load it

   For example? Just curious... (guess, you aren't referring UTF-8 to
UTF-32 conversion)

  <?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com    Blog:

Re: is any work being done to fix/improve PHP's string handling beyond 8 bits?

On 23 May 2005 14:06:21 -0700, wrote:

Quoted text here. Click to load it

 Well - your questions, if I recall, were less about PHP supporting multibyte
strings, but rather you were receiving strings from external sources with no
well-defined encoding, or worse they were coming in with an encoding different
from that defined by the originating page (the main current browsers handle
this badly) and so you were forced to try heuristics to identify the unknown
encoding of a series of bytes.

 Once you know what encoding a string is in, then PHP has wide support for
character set encodings.

Quoted text here. Click to load it

 That's because they're all in the Multibyte String section.

Quoted text here. Click to load it

 The one key sentence in there is:

"It does not make sense to have a string without knowing what encoding it


 PHP's "string" datatype is a bit of a misnomer; it's more like a "series of
bytes" datatype. The "plain" string functions, as in C, assume a single byte
encoding, and are pretty dumb about the mapping between that and characters.
Where there's any significance, some functions take a character set encoding
parameter, or default to ISO-8859-1. You have to keep track of what encoding
you're storing in strings.

 mbstring puts a bit more intelligence into it, since it knows about more
character set encodings, e.g. it can give you counts of characters for
multibyte encoded strings, or convert between encodings. But you still need to
know what encoding each string is in.

 Multibyte strings are still second-class citizens in PHP, but saying it has no
support for them is just wrong, mbstring has been around for ages. There's even
an option (mbstring.func_overload) that replaces the builtin single-byte
functions with multibyte-aware equivalents.

 You can still work with UTF-8 strings without mbstring, anyway. It just
depends what operations you perform on them. Concatenation is unaffected, as is
printing. Counting characters requires a multibyte aware function, but if you
never use strlen() on the strings, it doesn't matter what encoding they're in.

 If you want regular expressions, then the PCRE regexes have the "u" modifier
that treats the input as UTF-8.

 So it all looks pretty well covered.

 Perl only recently (in 5.8) finished the transition to natively supporting
utf8 strings (a process that began a long time ago). Strings in Perl are now
either a series of bytes of undefined encoding (i.e. C or PHP-style strings),
or have a utf8 flag set indicating they're UTF-8 encoded, which the builtin
string functions are aware of and so return the correct results in terms of

 That's one step up from PHP, since strings carry around some metadata with
them on their encoding, at least if they're UTF-8.

< Space: disk usage analysis tool

Re: is any work being done to fix/improve PHP's string handling beyond 8 bits?

Thanks again for all the help. You summarized the problem I faced last
year well. I didn't know about multi-byte string section of PHP. I'm
sad that extension is optional. I spend tonight reading that section.

Site Timeline