trouble processing non-English text

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View


I am trying to process some Greek text using Perl.  Strangely, I can
print out the text properly but when I try to assign the text to a
variable or do some processing, it fails.

The data file is:


My program is:

#!/usr/bin/perl -w
use strict;
use encoding "greek";

my %symbols = ();

open(FILE, "$file");

while (my $line = <FILE>) {

    my @fields = split(/\s+/, $line);

    my $num_fields = @fields;

    if ($num_fields =3D 2) {

    my $freq = shift(@fields);
    my $word = shift(@fields);

    print "$word\n";

    my @letters = split(//, $word);

    foreach my $letter (@letters) {
        $symbols = 1;

        print "$letter -> $letter_test\n";

    print "\n";

The output is:

=EF=BF=BD ->
=EF=BF=BD ->
=EF=BF=BD ->
=EF=BF=BD ->
=EF=BF=BD ->
=EF=BF=BD ->

=EF=BF=BD ->
=EF=BF=BD ->
=EF=BF=BD ->
=EF=BF=BD ->

I've done some reading on the web and I still can't figure out what's

I'd appreciate any help.  Thanks!

Re: trouble processing non-English text

DavidK wrote:

Quoted text here. Click to load it

In what sense does it fail?

What does `echo $LANG` show you?


Re: trouble processing non-English text

Quoted text here. Click to load it

'use warnings' is preferred to -w nowadays.

Quoted text here. Click to load it

Don't do that. In principle 'encoding' specifies the encoding of your
*source* file, and also pushes encoding layers onto STD; it has
no effect on other filehandles. In practice it has never worked properly
and should be avoided.

I don't know how your data file is encoded, but AFAIK "greek" is not a
valid encoding name. You might have meant "iso-8859-7", which I believe
is the usual pre-Unicode encoding for Greek, or you might have meant
"UTF-8" (or you might have meant something else entirely). You will need
to find out which.

Quoted text here. Click to load it

Always check the return value of open.
Use 3-arg open instead of magic 2-arg open, unless you've got a good
reason not to.
Don't quote variables when you don't need to.
Use lexical filehandles instead of global barewords.
In your case, you want to push an encoding PerlIO layer when you open
the file.

    open(my $FILE, "<:encoding(iso-8859-7)", $file)
        or die "can't open '$file': $!";

You might also consider using the 'autodie' module, which will do the
'or die' check for you.

Quoted text here. Click to load it

There's no need for this. '==' gives scalar context to both sides, so

    if (@fields < 2) {

will suffice.

Quoted text here. Click to load it

Where does $letter_test come from? Did you actually run the code you

Quoted text here. Click to load it

This suggests your file is not in ISO8859-7, but in some multi-byte
encoding like UTF-8 or UTF-16. If you're on a Unix machine it's probably


Re: trouble processing non-English text

Thanks for the responses!

My $LANG variable is set to en_US.UTF-8.

The file I thought was in ISO8859-7 is actually UTF-8.  I should have
been opening the file with >

    open(my $FILE, "<:encoding(UTF-8)", $file)
        or die "can't open '$file': $!";

I also had to format the output with

binmode STDOUT, ":utf8";

to view it properly.

Thanks again.  It seems to be working now.  thank you ben for the Perl
style tips.

I'm sorry about the confusing source code.  I tried to simplify it and
I removed some lines by mistake.

Quoted text here. Click to load it

Site Timeline