Click here to get back home

why does unicode.org offer many scripts if unicode is a single code for all characters?

 HomeNewsGroups | Search | About
 comp.infosystems.www.authoring.html    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
why does unicode.org offer many scripts if unicode is a single code for all characters? lkrubner 05-27-2005
Get Chitika Premium
Posted by lkrubner on May 27, 2005, 6:03 pm
Please log in for more thread options



Look at this page:

http://www.unicode.org/charts/

It almost looks as if there is a different encoding for each language.
Is that right?

Suppose I want to create weblog software that allows people to type
posts into a textarea on a form, and press a button and, viola, their
post is live on the web? If I don't know what language they are
speaking, what encoding would I pick for the charset header of the
form?

I was trying to output some text as an RSS feed with a UTF-8 header and
I got an error by the RSS reader which said the feed was invalid
because it contained characters that were not in the UTF-8 standard.
I'd like to create a little script that went through my input and
stripped out the non-UTF-8 characters. But how can I do this if I can't
find a list of the UTF-8 characters?


Posted by Tim on May 28, 2005, 1:05 pm
Please log in for more thread options


On 27 May 2005 17:03:06 -0700,
lkrubner@geocities.com posted:

> Look at this page:
>
> http://www.unicode.org/charts/
>
> It almost looks as if there is a different encoding for each language.
> Is that right?

No, one encoding for the lot, but with different sections for each
language. Imagine it this way (with a terribly fake example).

Characters 1 - 255 had English language characters,
Characters 256 - 500 provided French language characters,
and so on.

That's the general idea.

> Suppose I want to create weblog software that allows people to type
> posts into a textarea on a form, and press a button and, viola, their
> post is live on the web? If I don't know what language they are
> speaking, what encoding would I pick for the charset header of the
> form?

Yes, nasty. It should be determined between web browser and web server.
Though lots of people have both configured badly.

This is really a server issue, not HTML. So you're going to get more
useful answers elsewhere.

> I was trying to output some text as an RSS feed with a UTF-8 header and
> I got an error by the RSS reader which said the feed was invalid
> because it contained characters that were not in the UTF-8 standard.
> I'd like to create a little script that went through my input and
> stripped out the non-UTF-8 characters. But how can I do this if I can't
> find a list of the UTF-8 characters?

Look again at that unicode site, that lists what's covered in Unicode, and
somewhere you should also find information about UTF-8 encoding of the
characters. That's two separate things, by the way.

Try reading: <http://www.unicode.org/faq/utf_bom.html>

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please delete some files yourself.

Posted by lkrubner on May 28, 2005, 11:37 am
Please log in for more thread options



> > Look at this page:
> >
> > http://www.unicode.org/charts/
> >
> > It almost looks as if there is a different encoding for each language.
> > Is that right?
>
> No, one encoding for the lot, but with different sections for each
> language. Imagine it this way (with a terribly fake example).
>
> Characters 1 - 255 had English language characters,
> Characters 256 - 500 provided French language characters,
> and so on.

If you're trying to offer support for English and European languages,
you would concentrate on the latin implementations?



Posted by Tim on May 29, 2005, 3:07 am
Please log in for more thread options


Unattributed authors wrote:

>>> http://www.unicode.org/charts/
>>>
>>> It almost looks as if there is a different encoding for each language.
>>> Is that right?

Tim wrote:

>> No, one encoding for the lot, but with different sections for each
>> language. Imagine it this way (with a terribly fake example).
>>
>> Characters 1 - 255 had English language characters, Characters 256 -
>> 500 provided French language characters, and so on.

On Sat, 28 May 2005 10:37:04 -0700, lkrubner wrote:

> If you're trying to offer support for English and European languages,
> you would concentrate on the latin implementations?

I'm not so sure that you're approaching this in the right manner.

a. There's one encoding for the lot. It's unicode, you get everything
in the one package. Data comes in, the same data goes out, no need to
translate anything. That's the whole point of it, the end of all of
the headaches with having to support a plethora of different things.

b. Writers of some language often do use something foreign, you wouldn't
want to remove something that should have been left alone. How do you
tell if some strange character is an error, or was deliberately inserted
for some particular purpose?

I'd only be contemplating removing things that I knew were security
breaches, or caused some sort of non-security sort of problem.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please delete some files yourself.


Posted by Jukka K. Korpela on May 28, 2005, 1:39 pm
Please log in for more thread options


lkrubner@geocities.com wrote:

> http://www.unicode.org/charts/
>
> It almost looks as if there is a different encoding for each
> language. Is that right?

No, the impression is all wrong. The charts define code numbers for
characters; they do not fix the transfer encoding. Moreover, although
some blocks have been designed for some particular language, most
aren't.

> Suppose I want to create weblog software that allows people to type
> posts into a textarea on a form, and press a button and, viola,
> their post is live on the web? If I don't know what language they
> are speaking, what encoding would I pick for the charset header of
> the form?

A form has no charset header of its own. The character encoding is
defined by the encodign of the page, in practice. The suitable encoding
in your situation is UTF-8.

> I was trying to output some text as an RSS feed with a UTF-8 header
> and I got an error by the RSS reader which said the feed was
> invalid because it contained characters that were not in the UTF-8
> standard.

Well, _was_ it? Does the routine that prints the data actually print
character data as UTF-8 encoded?

> I'd like to create a little script that went through my
> input and stripped out the non-UTF-8 characters.

Why? Throwing away all UTF-8 characters would mean throwing away all
characters. Do you really want to pass only noncharacters? :-)

> But how can I do
> this if I can't find a list of the UTF-8 characters?

I'm afraid your problem is mostly outside the scope of this group.
Finding a list of UTF-8 characters would surely not solve your problem
(unfortunately, since finding the list is easy: UTF-8 characters are
the same as Unicode characters, which are listed at the Unicode Web
site).

Basically, use UTF-8, Luke. Declare it, create your files in UTF-8, and
convert all data to it when needed. Regarding input data, do some
checks on it instead of just assuming it's UTF-8; for example, include
a hidden field with some special characters in its value, and make your
program check that those characters are correctly included in the form
data.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html



Similar ThreadsPosted
UTF-8 & Unicode January 28, 2005, 7:45 pm
IE7 Does Not Support Some Unicode? July 24, 2007, 11:53 am
Using Hindi Language with Unicode January 5, 2007, 12:20 pm
unicode meta tag, http header July 8, 2004, 11:04 am
Can an HTML source file be specified in unicode ? March 13, 2005, 12:29 pm
Unicode and html - help for simple web site August 24, 2005, 6:44 pm
Can an HTML source file be specified in unicode ? October 12, 2006, 8:08 am
Welsh language - ISO-8859-1 or Unicode ? June 24, 2008, 1:00 pm
unicode and numeric character reference in html October 18, 2007, 4:12 pm
Single loading .swf file - code or frame? July 26, 2007, 10:26 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap