Reading contents of an excel file from a test file

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View


I am writing a script to read various file types (doc, xls, pdf, html
etc.) and search for certain keywords. Without caring for the file
formats, I used 'findstr' system call from perl with the keyword for
each file and directed the output to a text file. The results were not
as bad as I had expected :)
In the txt file created, I have the list of all the files and the
original text from the file where the string to be searched occurs but
this file has some unicode characters which prevents it to be read and
processed properly. :(
I basically get a lot of "" in the result test file which makes perl
act wierd.

Can u please suggest a way to read a text file which has unicode
I do NOT want to create seperate parsers for the different file types
(things like ParseExcel) as it will increase the complexity and will
need a lot of effort.



Re: Reading contents of an excel file from a test file

sorry guys.. but the unicode characters as they appear as rectangles
in my text file (all appear as the same), are not printed when posting
a message on this forum!!

Quoted text here. Click to load it

Re: Reading contents of an excel file from a test file

Top-posting corrected, Please don't top-post.

Mick wrote:
Quoted text here. Click to load it

I'm pretty sure that current versions of Perl are happy to process Unicode
    perldoc perlunicode

If I wanted to ignore characters that are outside the ASCII printable
set then I'd investigate Perl's 'tr'. `perldoc perlop` suggests
      tr/a-zA-Z/ /cs; # change non-alphas to single space

Quoted text here. Click to load it

I suspect there's no guarantee that arbitrary file types will store your
keywords in a recognisable form. A file might store "KEYWORD" as
"KExxxxxYWxxxxxOxxxxRD" for example. I'd guess this is particularly
likely in PDF, especially if it is kerning text. Some might use UTF8
encoding others might use UTF16 or some non-unicode encoding. Some might
compress or encode the text so it no longer appears in ASCII.

Quoted text here. Click to load it

You are using Google Groups and it seems to think your character set is
Latin1 not Unicode. Your posting has this header:
   Content-Type: text/plain; charset="iso-8859-1"

Possibly you are viewing your "text file" in an application that is not
Unicode aware or is not using a font that has glyphs for the particular
Unicode characters in the file.

Site Timeline