Screenscraping UTF-8 characters problem

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi! I'm having some problems correctly screenscraping and outputting
e.g. Chinese characters from a Google translator search result. The
output is always a garbled mess, not Chinese characters. German for
instance works fine. Thanks for any hints...!!

Some relevant parts from the PHP5:

header ('Content-type: text/html; charset=utf-8');
showResult( getTranslation('bird flu', 'zh-CN'), 'Chinese' );

function getTranslation($q, $lang)
    $out = '';
    // the Google page is supposed to be UTF-8 too:
    $in = getFileText( " |" .
urlencode($lang) . "&text=".urlencode($q) );
    preg_match('/<div id=result_box dir=ltr>(.*?)<\/div>/', $in,

    $translation = $out[1]; // garbled!
    $translation = trim($translation);
    $translation = utf8_encode($translation); // garbled with or
without this line...
    return $translation;


Re: Screenscraping UTF-8 characters problem

Philipp Lenssen kirjoitti:
Quoted text here. Click to load it

Seems to me what you need are the multibyte functions. You should  
replace the preg_match with the multibyte compatible mb_ereg_match:

Note that mb-functions aren't included in the default installation, you  
need to add them, check the instructions for installing:

"En ole paha ihminen, mutta omenat ovat elinkeinoni." -Perttu Sirviö | Gedoon-S @ IRCnet | rot13(xvzzb@bhgbyrzcv.arg)

Site Timeline