Chinese text in HTML page and Byte-Order Mark

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I've noticed that some pages use <span lang="zh" xml:lang="zh"> to embed  
Chinese text, but even simply embedding Chinese text in a UTF-8 HTML  
page seems to work fine as well, for instance:

Why then would this language declaration be necessary?

Another question, the above page validates without errors, but I get the  

Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to  
cause problems for some text editors and older browsers. You may want to  
consider avoiding its use until it is better supported.

How to remove the Byte-Order Mark?

Alfred Molon - Photos of Asia, Africa and Europe

Re: Chinese text in HTML page and Byte-Order Mark

Quoted text here. Click to load it

Open and save it with a decent text editor that doesn't save the BOM.  
TextWrangler, f'rinstance.


"That excessive bail ought not to be required, nor excessive fines imposed,
nor cruel and unusual punishments inflicted"  --  Bill of Rights 1689

Re: Chinese text in HTML page and Byte-Order Mark

Quoted text here. Click to load it

Look in your editor. In mine, you can choose:  



Re: Chinese text in HTML page and Byte-Order Mark


Quoted text here. Click to load it

I see. I just noticed that also my editor (Notepad++) has that option.  
What is the BOM needed for?

Alfred Molon - Photos of Asia, Africa and Europe

Re: Chinese text in HTML page and Byte-Order Mark

Quoted text here. Click to load it

It lets the reader (either human or mechanical) know what language a
piece of text is written in.  For example, a voice-based browser needs
to pronounce

  <h1 lang=en>Les Chats</h1>


  <h1 lang=fr>Les Chats</h1>

quite differently.


Re: Chinese text in HTML page and Byte-Order Mark

2013-05-28 0:16, Alfred Molon wrote:

Quoted text here. Click to load it

Yes, pages work without the lang attribute, but using it may have some  

Quoted text here. Click to load it

According to accessibility guidelines, the language of text should be  
declared, to help e.g. in speech synthesis. This applies to all texts,  
including English-language texts. But this is largely just theory,  
though it would apply especially strongly to Chinese texts, since the  
way "Chinese" characters (characters of Chinese origin, used for writing  
Chinese, Japanese, and other languages) may essentially depend on  
language. But speech synthesizers will guess the language or use a fixed  
language or use the language selected by the user.

There are other reasons for declaring language, see
but I will just illustrate one of them:

When I view a page containing Chinese characters, on Firefox, those  
characters appear in my system in the MS PGothic font, when the page  
does not have any font settings. If the characters are inside an element  
to which lang=zh applies, they appear in the SimSun font instead. And if  
the attribute is lang=zh-TW or lang=zh-Hant, they appear in PMingLiu.  
The reason is that the attribute makes the browser apply different  
default fonts.

Nowadays, few authors leave fonts unspecified. The main reason is  
probably that most browsers have Times New Roman as the default font,  
and it is common knowledge, or prejudice, that it is unsuitable for web  
pages. So authors declare Arial, because someone told it's cool, or  
Verdana, since someone said it's even cooler. And because those fonts  
aren't really cool at all in normal font size, authors too often set  
font size to something barely legible, but I digress.

On the page you mentioned, the font family declaration in CSS is  
font-family: Verdana, Arial, Helvetica, sans-serif. Since none of the  
specific font families listed contains Chinese characters, the browser  
will use its definition for sans-serif and, if it does not contain them  
either, pick them up from some of the fonts in the system, using its own  
internal rules.

The morale is that when using Chinese characters, you should take them  
into account when writing your font-family rule. This is not obligatory,  
but it's the right way to ensure (as far as possible) that the font used  
for them will be acceptable and will stylistically match the font used  
in the text otherwise.

And when you do so, the lang attribute does not matter in font selection  
- but it is advisable to use it for other reasons.

Quoted text here. Click to load it

That's grossly outdated information, probably retained just because some  
people think there *might* be some browser in use that has problems with  
BOM. There isn't. Hasn't been for many years. Except perhaps in a  
museum, where Netscape 2 and IE 3 can be seen.

In the modern world, BOM is *good* even in UTF-8. It acts as a  
practically certain way of indicating that the page is UTF-8 encoded,  
even if HTTP headers are missing (e.g., because the page has been saved  

You may have problems if you have a BOM at the start of a PHP file. But  
that's something completely different.

Quoted text here. Click to load it

You could remove it by using an editor that can save in the "UTF-8, no  
BOM" format. But there is no reason to remove it.


Site Timeline