Google and russian text

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I'd like to learn how Google handles non-Ascii text, especially russian.
As you may know, cyrillic has several possible encodings (koi*, cp-1251,
utf-8,...). Is everything converted to one encoding before indexing? and as
a google user, which encoding should I use for best results?

Re: Google and russian text

noop wrote:
Quoted text here. Click to load it

I too would be interested on advice on how to do Russian pages.  I have one
page successfully indexed and cached, like that below, but it comes rather
slow as every character is expanded to seven characters.  There must be a
better method.

<meta http-equiv="Content-Language" content="ru">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>&#1057;&#1040;&#1052;&#1052;&#1048;&#1058; etc </title>

I tried also doing a Babel Fish translation of several paragraphs.  A
genuine Russian speaker said the result was completely incomprehensible - so
I deleted it - so not much success in that area !

You can input things like &#1057;&#1040;&#1052;&#1052;&#1048;&#1058; as
search terms into Google and it works.

Best regards, Eric.

Re: Google and russian text


Quoted text here. Click to load it

What about UTF-8? I am using it for Polish text and it works OK.
The main drawback is that some programs (like Windows notepad) add
three characters at the beginning of the file and they can be a
problem sometimes.

Perhaps using better editor will help.

-- - chemical calculators for labs and education
BATE - Base Acid Titration and Equilibria
program for pH calculations
CASC - Concentration and Solution Calculator
program for solution preparation and concentration conversions

Re: Google and russian text

Sure, 7 characters is a bad idea :-)

Russian is NOT different from German or French or Polish -
all of them use their respective encodings -
Windows-1251, Windows-1250, Windows-1252 -
where a national letter is represented as 1 byte - 8 bit -
again, same thing for each European language including Russian.

So people create Russian HTML _exactly_ in the same way people
create German or Polish - they type REAL national letters and
not 7-character items:
 - person who creates Frenc page switches keyboard to "FR" and types
French text
 - persom who creates Russian page, switches keyboard to "RU" and types
Russian text
 - same for Polish

No difference at all...

Please see the section of my site called
"For developers: how to create correct Russian Web page" -
but it's THE SAME approach for any other language - being it
Polish or French or German.

Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
Russian On-screen Keyboard:

Re: Google and russian text


noop wrote:
Quoted text here. Click to load it


Never ever worked :-) it's why no one places that -
you can read comp.infosystems.www.authoring.html
about that.

Google converts everything (and not just for Russian) to UTF-8 -
all Google pages are in UTF-8 - easy to see - look at
Encoding menu of your browser and see what encoding is selected.
But it has _nothing_ to do with what _you_ should use for best

It really does not matter - Google does not care say in
what encoding a Polish page is - iso-8859-2 or Windows-1250
or in what encoding a Russian page is - KOI8-R or Windows-1251
or in what encoding a Japanese page is - SJIS or EUC.

Google just cares that your site DOES specifies its encoding!
Either via HTTP header or via META..charset= in the 'body'

Then Google will be able _correctly_ convert to UTF-8 -
being it KOI8-R ---> UTF-8 conversion or
Windows-1251 ---> UTF-8 conversion.

Same goes for any other language, being it French or Japanese :)

As for Russian, most Russian sites nowadays - 99% - are
made in Windows-1251, again, Google has nothing to do with that.

Please see
"For developers: how to make _correct_ Cyrillic (Russian) Web pages"
section on  my site.

If you want to learn how to _post_ a Newsgroup message
with Russian text using Google Groups or a real News program
such as Outlook Express or Mozilla News/Thunderbird,
then see
'Russian in Browsers/Mail/News' section of my site.

Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
Russian On-screen Keyboard:

Re: Google and russian text

Thanks for the info.

Site Timeline