Weird loadHTML behaviour

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi all,

I'm in the process of setting up a PHP script that reads a HTML file,
does a character conversion and then displays the contents of a single
HTML tag as follows:

        $str = mb_convert_encoding (file_get_contents ('aktuel.htm'),
'HTML-ENTITIES', 'ISO-8859-1');

        file_put_contents ('dmp.htm', $str);

        $dom = DOMDocument::loadHTML ($str);
        $elem = $dom->getElementsByTagName ('h5');
        if ($elem->length) {
                $n = $elem->item (0)->nodeValue;
                var_dump (bin2hex ($n));

What's interesting is that the source HTML file is properly ISO-8859-1
encoded (which the contents of "dmp.htm" verifies). The trouble starts
when I retrieve the contents of the first <h5> tag that has an umlaut
in it. In this case, the umlaut is screwed up - what used to be a
"Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ãœ" (
0xc3 0x9c
as the var_dump confirms). What surprises me are two things: that
somehow the character changes and that the umlaut is not HTML-encoded
as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux

Any thoughts?

  Cheers, Christoph

Re: Weird loadHTML behaviour

On May 9, 12:43 am, wrote:
Quoted text here. Click to load it
 (0xc3 0x9c
Quoted text here. Click to load it

After some :-) research, it turns out that the encoding of the
contents of the first <h5> tag
has acutally changed to UTF-8 - hence the strange byte sequence. This
begs the question
if the default encoding for parsed HTML strings in the DOM package is
UTF-8 (if we are looking
at HTML-ENTITIES-conformant encoding initially). Is this a bug of
DOMDocument or a feature?

  Cheers, Christoph

Site Timeline