HTML Tidy not changing the charset attribute

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View


I'm in the process of refactoring a lot of HTML documents and I'm
using html tidy (version from 1 September 2005) to do a part of this

I run it with a configuration file like that:

 tidy -config tidy_cfg.txt index.htm

and here is the content of my config file:

// tidy configuration to clean the old pages
tidy-mark: no
wrap: 66
wrap-attributes: yes
indent: auto
indent-spaces: 2
output-xhtml: yes
doctype: loose
char-encoding: utf8
break-before-br: yes
clean: yes
logical-emphasis: yes
drop-font-tags: yes
enclose-text: yes
alt-text: " "
write-back: yes
error-file: tidy-errors
show-warnings: no
quiet: yes

Everything is perfect except 1 thing, this line in the output

<meta http-equiv="Content-Type" content="text/html;" />

It doesn't add the charset=utf-8 part.

Can anybody help me with this one?

Re: HTML Tidy not changing the charset attribute

sebzzz wrote:

Quoted text here. Click to load it

Can't you add it in afterwards using, say, sed.

Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.12-12mdksmp, up 22 days, 16:08.]

                          demiblog 0.2.0 Released

Re: HTML Tidy not changing the charset attribute

Quoted text here. Click to load it

I could probably,

I never worked with Sed in the past. If you know the easiest way to do
that type of work (adding things in a specific place in all html
documents in a folder and also it's sub-folders and their sub-folders
and so on.)

I'm asking this because Tidy will just do a part of the work I need to
do, I have to remove all the presentational tags and attributes from
the pages (in other words rip off the pages) including the tables that
are used for disposition of content (how to differentiate?).

I though about doing that with python (for which I'm in the learning
process), but maybe an other tool (like sed) would be better suited
for this job.

I kind of know generally what I need to do:

1- Find all html files in the folders
2- Do some file I/O and feed Sed or Python or what else with a file.
3- Apply recursively some regular expression on the file to do the
things a want
4- Write the changed file, and go through all the files like that.

But I don't know how to do it for real, the syntax and everything.

If you can help me I would really appreciate.

Site Timeline