Click here to get back home

About charset setting and replacing

 HomeNewsGroups | Search | About
 comp.infosystems.www.authoring.html    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
About charset setting and replacing gmclee 07-14-2006
Get Chitika Premium
Posted by gmclee on July 14, 2006, 6:29 am
Please log in for more thread options


Hi there,
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"> SOME UNICODE HERE</spand>

2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

So, for leading the program to replace the correct one, I search the
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?

<META http-equiv=Content-Type content="text/html;" charset="us-ascii">

OR

<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>

OR

<META http-equiv=Content-Type content='text/html; charset=us-ascii'>


Any better approach for my problem?

p.s. Someone suggest me to send the original code to IE and then call
IE's charset setting function to change the charset, I try, but for my
UNICODE TEXT, aftering changing the charset, the UNICODE TEXT becomes
some meaningly code!!!

Thanks in advance.


Posted by Chris Morris on July 14, 2006, 6:47 am
Please log in for more thread options


gmclee@21cn.com writes:
> I am writing a program to load HTML from file and send it to IE
> directly. I've met some problem in charset setting. Most of HTML have
> charset "us-ascii", for some reason, some UNICODE TEXT will be
> inserted into the HTML before sending to IE. The problem is
>
> 1) Can I specify special charset for some component, e.g.
> <span charset="UTF-8"> SOME UNICODE HERE</spand>

No.

> 2) If "NO" for 1), so any way to change the charset of the original
> HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
> the charset programmly. I've checked the several HTML and find the
> CHARSET format like
>
> <META http-equiv=Content-Type content="text/html; charset=us-ascii">

The usual best solution is to set the real HTTP content type header.
Content-type: text/html; charset=UTF-8
This will override the <meta> element if there is one, so you don't need
to worry about the format.

Since any valid us-ascii character is also (the same) valid UTF-8
character you might as well do this all the time.

However, from the description you gave, it doesn't sound like you're
using HTTP.

> So, for leading the program to replace the correct one, I search the
> keyword "charset=" and get the position, and then search the position
> of double quotation marks, finally, I replace the substring with UTF8,
> everything seems fine. However, I am worrying about if there are some
> excepction. Will these, for example, happen?
>
> <META http-equiv=Content-Type content="text/html;" charset="us-ascii">
> <META http-equiv=Content-Type content='text/html;' charset='us-ascii'>
No.

> <META http-equiv=Content-Type content='text/html; charset=us-ascii'>
Might happen.

Additionally, the attribute names and tag name may or may not be
(partially) capitalised, as may the charset value, and possibly other
bits. There may be a slash immediately before the end of the tag (if
it's an XHTML document rather than a HTML document). The order of the
attributes may be reversed, so:
<MeTA ConTenT='text/html; charset=US-ascII'
htTp-EQUiv="Content-Type" />
is an unusual combination of the above, but still perfectly legal...

You might also get cases which have nothing to do with a <meta>
element, but trigger your pattern matching anyway.

> Any better approach for my problem?

Setting the HTTP headers is the best solution. If you can't do that
then using a real HTML parser is likely to be more reliable than any
search-and-replace you put together.

--
Chris

Posted by gmclee on July 14, 2006, 8:12 am
Please log in for more thread options



Chris Morris =E5=86=99=E9=81=93=EF=BC=9A

> gmclee@21cn.com writes:
> > I am writing a program to load HTML from file and send it to IE
> > directly. I've met some problem in charset setting. Most of HTML have
> > charset "us-ascii", for some reason, some UNICODE TEXT will be
> > inserted into the HTML before sending to IE. The problem is
> >
> > 1) Can I specify special charset for some component, e.g.
> > <span charset=3D"UTF-8"> SOME UNICODE HERE</spand>
>
> No.
>
> > 2) If "NO" for 1), so any way to change the charset of the original
> > HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
> > the charset programmly. I've checked the several HTML and find the
> > CHARSET format like
> >
> > <META http-equiv=3DContent-Type content=3D"text/html; charset=3Dus-asci=
i">
>
> The usual best solution is to set the real HTTP content type header.
> Content-type: text/html; charset=3DUTF-8
> This will override the <meta> element if there is one, so you don't need
> to worry about the format.
>
> Since any valid us-ascii character is also (the same) valid UTF-8
> character you might as well do this all the time.
>
> However, from the description you gave, it doesn't sound like you're
> using HTTP.
I am writing a client to change HTML dynamically. All HTML are saved on
local Harddisk, it's nothing relate to network prototype.

> > So, for leading the program to replace the correct one, I search the
> > keyword "charset=3D" and get the position, and then search the position
> > of double quotation marks, finally, I replace the substring with UTF8,
> > everything seems fine. However, I am worrying about if there are some
> > excepction. Will these, for example, happen?
> >
> > <META http-equiv=3DContent-Type content=3D"text/html;" charset=3D"us-as=
cii">
> > <META http-equiv=3DContent-Type content=3D'text/html;' charset=3D'us-as=
cii'>
> No.
>
> > <META http-equiv=3DContent-Type content=3D'text/html; charset=3Dus-asci=
i'>
> Might happen.
>
> Additionally, the attribute names and tag name may or may not be
> (partially) capitalised, as may the charset value, and possibly other
> bits. There may be a slash immediately before the end of the tag (if
> it's an XHTML document rather than a HTML document). The order of the
> attributes may be reversed, so:
> <MeTA ConTenT=3D'text/html; charset=3DUS-ascII'
> htTp-EQUiv=3D"Content-Type" />
> is an unusual combination of the above, but still perfectly legal...
I am not quite familiar with HTML, As you mention above, for both HTML
and XHTML, if the following valid ?

<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF-8">

> You might also get cases which have nothing to do with a <meta>
> element, but trigger your pattern matching anyway.
>
> > Any better approach for my problem?
>
> Setting the HTTP headers is the best solution. If you can't do that
> then using a real HTML parser is likely to be more reliable than any
> search-and-replace you put together.
> =20
Thanks. I see.


Posted by Chris Morris on July 14, 2006, 9:00 am
Please log in for more thread options


gmclee@21cn.com writes:
> I am not quite familiar with HTML,

See http://www.w3.org/TR/HTML4/ for the official specifications.

> As you mention above, for both HTML
> and XHTML, if the following valid ?
>
> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

No - this is valid in HTML, but not in XHTML. Internet Explorer does
not support XHTML and treats it as if it were HTML. You may find in
XHTML source documents something like this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
which is valid in XHTML but not valid in HTML.

--
Chris

Posted by Jack on July 14, 2006, 6:54 am
Please log in for more thread options


gmclee@21cn.com wrote:
> Hi there, I am writing a program to load HTML from file and send it
> to IE directly. I've met some problem in charset setting. Most of
> HTML have charset "us-ascii", for some reason, some UNICODE TEXT
> will be inserted into the HTML before sending to IE. The problem is
>
> 1) Can I specify special charset for some component, e.g. <span
> charset="UTF-8"> SOME UNICODE HERE</spand>

1. UTF-8 isn't a charset, it's an encoding.
2. The UTF-8 encoding includes and encompasses all of US-ASCII.
3. Encodings apply to pages, not to HTML fragments.

If you create a page that is encoded as UTF-8, and serve it as UTF-8,
US-ASCII characters will automatically be rendered correctly.

What I don't understand is what you mean by "send it to IE directly".
Are you writing a server? If so, then you need to look into how to serve
pages encoded as UTF-8 (and that would be off-topic here).

--
Jack.

Similar ThreadsPosted
Setting Correct Charset? June 29, 2007, 10:24 am
Replacing name with id January 5, 2008, 11:45 pm
css replacing tables August 23, 2004, 11:16 am
Replacing outdated tag attributes June 15, 2006, 6:23 pm
replacing innerHTML in xhtml July 28, 2008, 4:49 am
Replacing links en masse on an ftp accessible server only March 10, 2008, 3:35 pm
Setting Row heights January 11, 2005, 8:35 am
setting up a store - looking for help August 25, 2005, 7:31 pm
Setting up imgmaps? February 9, 2008, 9:22 am
Setting a Cell Image October 1, 2004, 3:01 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap