Question regarding Encode

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hello! If this is the wrong group my apologies. I'll accept pointing
in the right direction if there is one.

We have several users in our company cutting and pasting from Word
into our CMS and now have the need to convert from "Windows-1252" to

for in-place editing I have been using
perl -MEncode=from_to -i -pe 'from_to($_, "windows-1252", "utf-8")'
file1.txt file2.txt

I am now needing to convert multiple files in a dir and another
developer mentioned that if UTF-8 and Windows-1252 are intermixed then
there could be some confusion of the two character sets together.
Transliteration was suggested..


for example.

What I am wondering is if that is indeed the case. I don't want to
have to resort to transliteration if it isn't necessary.

Maybe I need some kind of check to see if a file is encoded a certain
way before figuring out how to jump into it. I can't ever remember
using Encode before and now we need it on a massive scope.

Any advice would be appreciated.

Flames go quietly to /dev/null

Re: Question regarding Encode

Quoted text here. Click to load it

I'm not quite sure what the concerns are here, but it sounds a lot like
superstition. If each file is consistent within itself, then from_to
will work perfectly well; you can use Encode::Guess to figure out
whether a file is UTF8 or 1252, and since you're only using UTF8 or an
8bit superset of ASCII, it should be 100% reliable. It would be best to
feed the whole file to guess_encoding in one go (use File::Slurp rather
than <> or -p), and specify UTF8 first on the list, so that pure ASCII
is guessed as utf8 rather than 1252 (since either is valid, and you
don't need to re-encode that file).

If some files contain some portions in UTF8 and some portions in 1252,
then you have a serious problem whatever tool you use. My suggestion
would be to attempt to find blocks you can split the file into, where
each block is guaranteed to have a consistent encoding. Then you can
pass these blocks to guess_encoding individually.


Quoted text here. Click to load it

'Deserve [death]? I daresay he did. Many live that deserve death. And some die
that deserve life. Can you give it to them? Then do not be too eager to deal
out death in judgement. For even the very wise cannot see all ends.'
I've seen things you people wouldn't believe: attack ships on fire off
the shoulder of Orion; I watched C-beams glitter in the dark near the
Tannhauser Gate. All these moments will be lost, in time, like tears in rain.
Time to die.                                         

Re: Question regarding Encode

On Tue, 08 Jul 2008 14:47:55 -0700, williams.wilkie wrote:

Quoted text here. Click to load it

That looks OK.

Quoted text here. Click to load it

I'm not sure I correctly understand the problem. Do you mean some files
are one encoding, some in an other (an issue, but probably solvable), or
that some files use multiple encodings within one file (a very big

Quoted text here. Click to load it

Please don't do that. That sort of thing will bite you in the ass.

Quoted text here. Click to load it

There are some heuristic algorithms to do just that, but to be honest I
would assume all data is in the same encoding unless you have proof
otherwise. If it isn't, your CMS *REALLY* screwed up.


Leon Timmermans

Re: Question regarding Encode

Quoted text here. Click to load it

I have seen this before with other CMSs where someone types something
and then cuts
and pastes from Word and then the data is mixed when stored in MySQL.
MySQL doesn't care what you have it encoded in, but the
problem comes when automated routines create XML files that are then
stored with mixed
encoding (CMS data stored into MySQL, another routine generates static
XML files from the faulty data for usage by other places).

Certainly makes the point that the data needs to be validated before
going into the db, but I can
feel the poster's pain regarding this issue.

Maybe specifying your IN and OUT filehandles as ':bytes' would help
(to preserve data and inhibit automated encoding
that may result in unexpected changed to your already formatted
Once you read in then use the transliteration method you described
before to change things. I'm not a huge fan of using that
method either but that's the way it was done not too many years ago.

I'd like to see other suggestions on this one too.

Re: Question regarding Encode

On Fri, 11 Jul 2008 06:53:39 -0700, worldcyclist wrote:

Quoted text here. Click to load it

Actually that's not true. MySQL has excellent support for various
encodings and collations. See chapter 9 of the MySQL reference manual for
more information on that. Most programmers don't seem to use it though.

Quoted text here. Click to load it
Full agreement there.

Leon Timmermans

Site Timeline