Click here to get back home

Welsh language - ISO-8859-1 or Unicode ?

 HomeNewsGroups | Search | About
 comp.infosystems.www.authoring.html    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Welsh language - ISO-8859-1 or Unicode ? Simon 06-24-2008
Get Chitika Premium
Posted by Geoff Berrow on June 25, 2008, 1:41 pm
Please log in for more thread options
following:

>Presumably your newsreader thinks it needs a separate font to find
>suitable gyphs for these characters (mine does too).

Yup, just get a load of question marks in Agent.

--
Geoff Berrow 0110001001101100010000000110
001101101011011001000110111101100111001011
100110001101101111001011100111010101101011

Posted by Hendrik Maryns on June 26, 2008, 5:43 pm
Please log in for more thread options
Op 25-06-08 18:01 heeft Ben Bacarisse als volgt van zich laten horen:
>
>> Harlan Messinger wrote:
>>> Blinky the Shark wrote:
>>>> Holy crap. I'm looking at two of your posts, and both in the body and in
>>>> the article's line in the headers pane, your name is not in the font I
>>>> have configured. And it's a *different* not-configured-by-me font in the
>>>> body than in the headers pane.
>>> I noticed the same thing, in Thunderbird.
>> His FROM line reads
>>
>> From: =?UTF-8?B?77yh772O772E772S772F772B772T44CA77yw772S772J772M772P772Q?=
>>
>> I don't know what to make of this.
>
> If I cut and paste to my utf-8-dump program:
>
> $ utf-8-dump -f '[%u] %n\n'
> Andreas 
> [U+FF21] FULLWIDTH LATIN CAPITAL LETTER A
> [U+FF4E] FULLWIDTH LATIN SMALL LETTER N
> [U+FF44] FULLWIDTH LATIN SMALL LETTER D
> [U+FF52] FULLWIDTH LATIN SMALL LETTER R
> [U+FF45] FULLWIDTH LATIN SMALL LETTER E
> [U+FF41] FULLWIDTH LATIN SMALL LETTER A
> [U+FF53] FULLWIDTH LATIN SMALL LETTER S
> [U+3000] IDEOGRAPHIC SPACE
> [U+000A] <control>

Those are actually fake Latin letters, which are used in Japanese and
Chinese systems, since the CJK symbols are broader, therefore they have
broader latin letters as well. It's a mean trick to make your name look
different without using html.

> Presumably your newsreader thinks it needs a separate font to find
> suitable gyphs for these characters (mine does too).

That's very probable, since most fonts won't contain those glyphs.
You'd need a Chinese/Japanese font which contains them.

H.

Posted by Jukka K. Korpela on June 24, 2008, 1:37 pm
Please log in for more thread options
Scripsit Simon:

> I'm working on a team that is planning to add Welsh language support
> to a large existing IT system which is partially web-based and
> English-language-only so far.

Do you plan to add other languages later? Is this about names only or
also about prose texts? After all, ISO-8859-1 is insufficient even for
normal English prose; think about dashes and proper quotations marks.

> I've heard that 2 characters in Welsh
> (w-circumflex and y-circumflex) are not supported in our default
> ISO-8859-1 character set,

Right. They are included in ISO-8859-14 (a.k.a. ISO Latin 8, or
"Celtic"), but thats not a feasible option on the WWW (IE does not
recognize that encoding).

> so a partial move to Unicode for internal
> storage of text might be required.

That might be easy, or it might be extremely complicated. But that's
really beyond the scope of these groups. As far as WWW authoring is
concerned, Unicode - specifically UTF-8 - is a good option, but you
could keep using ISO-8859-1 and represent those letters using character
references like &#373; for w with circumflex. But you might have to deal
with the encoding problem of the data bases involved, for example, and
with data entry.

> I haven't yet found a Welsh-language website that uses these 2
> characters, so are they actually used much in Welsh?

I don't know Welsh, but I expect those characters to be so rare that
using some clumsy notation like character references for them wouldn't
be a major problem.

> Is not supporting them likely to cause problems?

Some people might say that it is tolerable to omit the circumflex, but
it may be distinctive (i.e. the only difference between otherwise
identical words, thought the context usually resolves the issue). And in
2008, I think it is inappropriate to add support to languages to IT
systems without supporting them properly, with all the characters needed
for their correct writing.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/


Posted by Simon on June 24, 2008, 2:00 pm
Please log in for more thread options
> Scripsit Simon:
>
> > I'm working on a team that is planning to add Welsh language support
> > to a large existing IT system which is partially web-based and
> > English-language-only so far.
>
> Do you plan to add other languages later? Is this about names only or
> also about prose texts? After all, ISO-8859-1 is insufficient even for
> normal English prose; think about dashes and proper quotations marks.
>
> > I've heard that 2 characters in Welsh
> > (w-circumflex and y-circumflex) are not supported in our default
> > ISO-8859-1 character set,
>
> Right. They are included in ISO-8859-14 (a.k.a. ISO Latin 8, or
> "Celtic"), but thats not a feasible option on the WWW (IE does not
> recognize that encoding).
>
> > so a partial move to Unicode for internal
> > storage of text might be required.
>
> That might be easy, or it might be extremely complicated. But that's
> really beyond the scope of these groups. As far as WWW authoring is
> concerned, Unicode - specifically UTF-8 - is a good option, but you
> could keep using ISO-8859-1 and represent those letters using character
> references like &#373; for w with circumflex. But you might have to deal
> with the encoding problem of the data bases involved, for example, and
> with data entry.
>
> > I haven't yet found a Welsh-language website that uses these 2
> > characters, so are they actually used much in Welsh?
>
> I don't know Welsh, but I expect those characters to be so rare that
> using some clumsy notation like character references for them wouldn't
> be a major problem.
>
> > Is not supporting them likely to cause problems?
>
> Some people might say that it is tolerable to omit the circumflex, but
> it may be distinctive (i.e. the only difference between otherwise
> identical words, thought the context usually resolves the issue). And in
> 2008, I think it is inappropriate to add support to languages to IT
> systems without supporting them properly, with all the characters needed
> for their correct writing.
>
> --
> Jukka K. Korpela ("Yucca")
> http://www.cs.tut.fi/~jkorpela/
>

Thanks for your reply.

Unfortunately multi-lingual support has not really been a priority in the
system design up to now,
although it has always been a possible future requirement. The system is a
complex mixture of
databases, Windows applications and web applications. I believe all the
databases and programming
languages we use already support Unicode , so I would aim to use that
support, rather than character
references which would be clumsy as you say.



Posted by Jukka K. Korpela on June 24, 2008, 2:19 pm
Please log in for more thread options
Scripsit Simon:

> I believe all
> the databases and programming
> languages we use already support Unicode , so I would aim to use that
> support, rather than character
> references which would be clumsy as you say.

Sounds like a simple way to go then. It is surely simplest to use
Unicode throughout, especially if character data needs to be transferred
between applications as plain text (where no character references or
markup can be used). It's also simplest in data entry if people
immediately see what they have typed, and entering characters with
circumflex should not be a problem; you can e.g. use the keyboard layout
outlined at
http://en.wikipedia.org/wiki/Keyboard_layout#United_Kingdom_extended

Yet, it's always possible that some software component doesn't grok
Unicode. Let's hope such problems are solvable. The web-related
components shouldn't be a problem.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/


Similar ThreadsPosted
Using Hindi Language with Unicode January 5, 2007, 12:20 pm
why does unicode.org offer many scripts if unicode is a single code for all characters? May 27, 2005, 6:03 pm
More than one language in a page October 21, 2008, 5:28 pm
HTML Template Language? December 31, 2005, 9:48 am
Limiting the language in a text box to english only November 7, 2004, 8:00 am
Charsets on multi-language website September 10, 2005, 4:38 am
W3C discussion of link types and language February 24, 2006, 11:26 am
Foreign language characters in forms December 5, 2008, 10:37 am
Change of natural language inside alt text? September 8, 2004, 3:37 am
Public identifier language: meaningless or nonsense December 23, 2004, 2:45 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap