|
Posted by Patrick Van Esch on March 13, 2005, 12:29 pm
Please log in for more thread options
Hello,
I have the following problem of principle:
in writing HTML pages containing ancient greek, there are two
possibilities: one is to write the unicode characters directly
(encoded as two bytes) into the HTML source, and save this source not
as an ASCII text, but as a UNICODE text file (using 16 bits per
character, also for the Western ASCII characters, which are usually
encoded as Ox00XX with XX the ASCII code) ; or to write a pure ASCII
HTML source, where the greek characters are all encoded with the
&#XXXX symbols. I have even a small computerprogram that converts the
former in the latter.
The funny thing is, that a browser such as Netscape7.2 seems to have
no problems accepting a unicode encoded sourcefile and displays
everything all right.
Now, the discussion I'm having with other people is the following:
as it is easier to type directly the unicode HTML source, is this, in
general, an acceptable thing to do, or is this (that's my viewpoint) a
totally unethical thing to do that simply works because of some
sloppiness in Netscape, but that HTML source code was never
intentioned not to be ASCII text in the first place ? I would like
them to see that I should run their source files through my program
that converts a unicode file into an ASCII file with the true unicode
characters (in casu ancient greek symbols) replaced by &#XXX ascii
character sequences ; their point of view is that this is bullshit,
and given the fact that it works for Netscape, that means that it is a
correct thing to do.
So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?
thanks for any learned enlightment,
Patrick.
|
|
Posted by phil_gg04 on March 13, 2005, 2:00 pm
Please log in for more thread options
show/hide quoted text
> Must HTML source code be an ASCII code, or is it now allowed to be
> UNICODE encoded text ?
Your web server specifies the character set in the headers of the HTTP
response that preceed that actual HTML. For example, if it sends:
Content-Type: text/html; charset=iso-8859-1
then it is latin 1, whereas if it sends
Content-Type: text/html; charset=UTF-16
then it is 16-bit unicode.
So if you set up your web server appropriately you can certainly send
the greek in Unicode, and browsers will understand it.
If the server doesn't specify a character set you may be able to use a
META tag in the start of the document, but generally this will only
work to distinguish between characters sets like UTF-8 and iso-8859-1
where the "ASCII" characters overlap; a META tag will not help if you
are sending UTF-16 (I think).
Do read http://www.w3.org/TR/REC-html40/charset.html
--Phil.
|
|
Posted by Patrick Van Esch on March 14, 2005, 2:09 am
Please log in for more thread options
Thanks already for all answers here, they are very enlightening!
I'm beginning to see a bit more clear in this character jungle.
Patrick.
|
|
Posted by C A Upsdell on March 13, 2005, 3:33 pm
Please log in for more thread options
Patrick Van Esch wrote:
show/hide quoted text
> So, what should be the outcome of this (academic) discussion ?
> Must HTML source code be an ASCII code, or is it now allowed to be
> UNICODE encoded text ?
HTML uses unicode.
|
|
Posted by Alan J. Flavell on March 13, 2005, 9:43 pm
Please log in for more thread options
On Sun, 13 Mar 2005, C A Upsdell wrote:
show/hide quoted text
> Patrick Van Esch wrote:
> > So, what should be the outcome of this (academic) discussion ?
> > Must HTML source code be an ASCII code, or is it now allowed to be
> > UNICODE encoded text ?
>
> HTML uses unicode.
Anyone who *understood* what that cryptic answer meant, would not have
needed to ask the question in the first place!!!
I see that Henri Sivonen has offered a more constructive answer.
I might add that writing &#number; notations with ASCII characters
certainly produces a rather bullet-proof source, that can be calmly
passed through cross-platform transfers and so forth, in ways that
might result in trashed utf-8-encoded source. But frankly, if you
have a means to author documents using utf-8 encoding, and a proven
way to upload them to the server and serve them out properly, then
there's really nothing to gain from resorting to &#number; notations
in ASCII instead.
My offering on this topic would be the charset checklist -
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -
which offers a number of scenarios. But as time goes by, the earlier
techniques (coding in ascii and using &-notations) become less and
less /necessary/ to use, even though they continue to be entirely
/valid/ if you have some other reason to want to use them.
Bottom line: if the questioner's authoring software supports it, then
follow scenario 7 in the checklist - actual utf-8 coded characters.
As Henri says, Windows's internal representation of unicode uses utf16
(little-endian, if I'm not mistaken), but for use on the WWW but I
would definitely prefer utf-8, which has been in use for quite a while
(it's even supported by that old dog Netscape 4.*, at least to a
degree). But read also the current "which charset" thread for remarks
about forms submission.
|
| Similar Threads | Posted | | Can an HTML source file be specified in unicode ? | October 12, 2006, 8:08 am |
| Why can't I save this page as a file and where's the source html? | June 22, 2005, 2:08 am |
| why does unicode.org offer many scripts if unicode is a single code for all characters? | May 27, 2005, 6:03 pm |
| Unicode and html - help for simple web site | August 24, 2005, 6:44 pm |
| unicode and numeric character reference in html | October 18, 2007, 4:12 pm |
| Lenth of lines in html source? | July 7, 2008, 9:33 pm |
| better/easy way to displaying c source code in html | July 18, 2006, 5:44 am |
| Are there any open source ajax html editors? | September 27, 2006, 6:32 pm |
| Searching HTML+CGI source for generating KO tournament management | June 28, 2007, 7:20 am |
| Playing a local mpeg file from a local HTML file... | July 24, 2005, 6:02 pm |
|
> UNICODE encoded text ?