Getting plain text out of a spreadsheet

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi all,

I am stumped.  It appears, Excel, Numbers and other spreadsheet programs
like to turn text fields into unicode. As a result, I get spaced out
of the form

   h e l l o   w o r l d

I would like to find a way to turn them into plain text.  I have tried
tricks like

decode('utf8', $str);

but it appears to return the same string back.

I checked the length and sure enough, the weird string is twice as long as
number of characters.  What is the best way to get my "hello world" back

I do not know what encoding scheme is being used.  I noticed that Numbers
turns all text fields into unicode.  On the other hand, Excel only does
when you use characters like " .



+--------------------------[ SERVER SIGNATURE ]-----+
| Article posted via Web Developer's USENET Archive |
|          |
| Web and RSS gateway to your favorite newsgroup -  |
|              comp.lang.perl.modules               |

Re: Getting plain text out of a spreadsheet

On Fri, 3 Dec 2010, KateYoak wrote:

Quoted text here. Click to load it

Taling twice as much space is not a feature of UTF-8. Rather, the
characters you are probably expecting (Latin letters, digits, ...)
look the same in UTF-8 as in ASCII. If it were Unicode, it would look
quite normal with the possible exception of few non-ASCII characters
replaced by weird 2- or 3-character sequences.

It looks more like UTF-16 or UCS-2 (which should be identical for your
needs). Try that. If it does not work, we would have to look into the data
to find out what it is.  

Helmut Richter

Re: Getting plain text out of a spreadsheet

responding to
KateYoak wrote:

Quoted text here. Click to load it

Yes! UCS-2 was the right answer. UTF-16 died:  UTF-16:Unrecognised BOM 4b .

Now I have to figure out if there is a generalized way of dealing with
this.  I
want to be able to process spreadsheets updloaded with different software.
M$Excel use something different from Numbers?  from OpenOffice?  etc.
Though I
have only seen two behaviors thus far: normal and spaced out characters.
intend to test and see.  A generalized solution might look something like

  my $l = length($str);
  $str = try_next_encoding($str);
  return $str if length($str) < $l;
  return $original_str; #unmolested

Any better ideas?


Re: Getting plain text out of a spreadsheet

On Fri, 3 Dec 2010, KateYoak wrote:

Quoted text here. Click to load it

I was not aware that BOM (byte-order mark) are treated differently in
these two codes.

Now, for ISO-8859-1 characters (and in particular for ASCII characters
which are a subset of the ISO-8859-1 characters), an UCS-2 character is
just a 16-bit representation of the 8 bit that make up the character. If
you are sure that there are no characters that take more than 8 bits, you
can perhaps just throw away the 8 zero bits that make the difference:

  $text =~ s/0//g;

And, if there is a BOM, throw it away as well. It should always be at the

  if ($text =~ /^(67|76)/) {
    $text = $';

This simplified version will, of course, not notice problems if your text
is either not UCS-2 or UTF-16 or if it contains characters outside the
ISO-8859-1 character set (i.e. "smart quotes", dashes, ...). On the other
hand, if you just decode after input but do not encode prior to output,
you might run into the same problems.

Helmut Richter

Site Timeline