|
Posted by gmclee on July 14, 2006, 8:12 am
Please log in for more thread options
Chris Morris =E5=86=99=E9=81=93=EF=BC=9A
> gmclee@21cn.com writes:
> > I am writing a program to load HTML from file and send it to IE
> > directly. I've met some problem in charset setting. Most of HTML have
> > charset "us-ascii", for some reason, some UNICODE TEXT will be
> > inserted into the HTML before sending to IE. The problem is
> >
> > 1) Can I specify special charset for some component, e.g.
> > <span charset=3D"UTF-8"> SOME UNICODE HERE</spand>
>
> No.
>
> > 2) If "NO" for 1), so any way to change the charset of the original
> > HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
> > the charset programmly. I've checked the several HTML and find the
> > CHARSET format like
> >
> > <META http-equiv=3DContent-Type content=3D"text/html; charset=3Dus-asci=
i">
>
> The usual best solution is to set the real HTTP content type header.
> Content-type: text/html; charset=3DUTF-8
> This will override the <meta> element if there is one, so you don't need
> to worry about the format.
>
> Since any valid us-ascii character is also (the same) valid UTF-8
> character you might as well do this all the time.
>
> However, from the description you gave, it doesn't sound like you're
> using HTTP.
I am writing a client to change HTML dynamically. All HTML are saved on
local Harddisk, it's nothing relate to network prototype.
> > So, for leading the program to replace the correct one, I search the
> > keyword "charset=3D" and get the position, and then search the position
> > of double quotation marks, finally, I replace the substring with UTF8,
> > everything seems fine. However, I am worrying about if there are some
> > excepction. Will these, for example, happen?
> >
> > <META http-equiv=3DContent-Type content=3D"text/html;" charset=3D"us-as=
cii">
> > <META http-equiv=3DContent-Type content=3D'text/html;' charset=3D'us-as=
cii'>
> No.
>
> > <META http-equiv=3DContent-Type content=3D'text/html; charset=3Dus-asci=
i'>
> Might happen.
>
> Additionally, the attribute names and tag name may or may not be
> (partially) capitalised, as may the charset value, and possibly other
> bits. There may be a slash immediately before the end of the tag (if
> it's an XHTML document rather than a HTML document). The order of the
> attributes may be reversed, so:
> <MeTA ConTenT=3D'text/html; charset=3DUS-ascII'
> htTp-EQUiv=3D"Content-Type" />
> is an unusual combination of the above, but still perfectly legal...
I am not quite familiar with HTML, As you mention above, for both HTML
and XHTML, if the following valid ?
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF-8">
> You might also get cases which have nothing to do with a <meta>
> element, but trigger your pattern matching anyway.
>
> > Any better approach for my problem?
>
> Setting the HTTP headers is the best solution. If you can't do that
> then using a real HTML parser is likely to be more reliable than any
> search-and-replace you put together.
> =20
Thanks. I see.
|