|
Posted by SwordAngel on December 17, 2004, 11:17 am
Please log in for more thread options
Hello,
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?
thx.
|
|
Posted by David Dorward on December 17, 2004, 7:28 pm
Please log in for more thread options
SwordAngel wrote:
> I'm looking for a program that converts characters of different
> encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
> escape sequences. Anybody knows where I can find one?
IIRC Tidy will do that.
http://tidy.sf.net/
--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/> Home is where the ~/.bashrc is
|
|
Posted by Bjoern Hoehrmann on December 17, 2004, 8:48 pm
Please log in for more thread options * David Dorward wrote in comp.infosystems.www.authoring.html:
>SwordAngel wrote:
>> I'm looking for a program that converts characters of different
>> encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
>> escape sequences. Anybody knows where I can find one?
>
>IIRC Tidy will do that.
Well, yes, but only for character encodings it supports (and it does not
support any of the encodings SwordAngel listed to that extend). Windows
users can compile Tidy with an experimental feature that enables support
for all character encodings Windows / Internet Explorer support via the
TIDY_WIN32_MLANG_SUPPORT #define, but it is generally better to use ex-
ternal tools such as iconv, piconv, uconv, recode, ... to convert the
document to UTF-8 and let Tidy process the document accordingly.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
|
|
Posted by Nick Kew on December 18, 2004, 12:42 am
Please log in for more thread options
>>IIRC Tidy will do that.
Indeed. I was on the point of suggesting AN XML processor until I saw
that (libxml2 accepts HTML as well as XML input).
> Well, yes, but only for character encodings it supports (and it does not
> support any of the encodings SwordAngel listed to that extend).
Indeed, libxml2 (last time I checked) supports some but not all of
those encodings, so the same limitation applies.
Have you considered tying in iconv to Tidy to improve i18n support?
> but it is generally better to use ex-
> ternal tools such as iconv, piconv, uconv, recode, ... to convert the
> document to UTF-8 and let Tidy process the document accordingly.
I believe OpenSP supports all the encodings named, though I'm
not entirely sure OTTOMH. So there may still be a one-stop
program for the conversion. But as Björn says, a transcoder
such as iconv is a more general solution.
--
Nick Kew
Nick's manifesto: http://www.htmlhelp.com/~nick/
|
|
Posted by Bjoern Hoehrmann on December 18, 2004, 10:35 am
Please log in for more thread options * Nick Kew wrote in comp.infosystems.www.authoring.html:
>Have you considered tying in iconv to Tidy to improve i18n support?
I wrote an experimental iconv wrapper which is included in the source
distribution, but it is not plugged into the code, i.e., you need to
change a few things in order to use it. Development of these features
was put on hold until a better interface for pluggable transcoders for
Tidy has been developed (which has not happend yet).
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
|
| Similar Threads | Posted | | ampersand character in URIs | December 30, 2007, 5:35 pm |
| Commandline RTF to HTML converter | February 1, 2005, 4:19 pm |
| Google HTML -> WAP converter | January 15, 2006, 12:36 am |
| anybody using Logictran R2net RTF to HTML converter ? | May 24, 2007, 8:35 am |
| ampersand in urls when using xhtml 1.0 strict | December 17, 2007, 8:30 am |
| unicode and numeric character reference in html | October 18, 2007, 4:12 pm |
| Character & not allowed | April 10, 2008, 6:06 pm |
| Re: Character encoding | April 26, 2008, 12:39 pm |
| Using character entities in us-ascii | July 26, 2004, 5:31 am |
| What special character notation is this | October 22, 2004, 5:07 pm |
|