Click here to get back home

Reading contents of an excel file from a test file

 HomeNewsGroups | Search | About
 comp.lang.perl.modules    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
Reading contents of an excel file from a test file Mick 05-15-2007
Posted by Mick on May 15, 2007, 2:49 am
Please log in for more thread options


Hi

I am writing a script to read various file types (doc, xls, pdf, html
etc.) and search for certain keywords. Without caring for the file
formats, I used 'findstr' system call from perl with the keyword for
each file and directed the output to a text file. The results were not
as bad as I had expected :)
In the txt file created, I have the list of all the files and the
original text from the file where the string to be searched occurs but
this file has some unicode characters which prevents it to be read and
processed properly. :(
I basically get a lot of "" in the result test file which makes perl
act wierd.

Can u please suggest a way to read a text file which has unicode
characters??
I do NOT want to create seperate parsers for the different file types
(things like ParseExcel) as it will increase the complexity and will
need a lot of effort.

Cheeeeers!!

KRN!!?!


Posted by Mick on May 15, 2007, 4:16 am
Please log in for more thread options


sorry guys.. but the unicode characters as they appear as rectangles
in my text file (all appear as the same), are not printed when posting
a message on this forum!!

> Hi
>
> I am writing a script to read various file types (doc, xls, pdf, html
> etc.) and search for certain keywords. Without caring for the file
> formats, I used 'findstr' system call from perl with the keyword for
> each file and directed the output to a text file. The results were not
> as bad as I had expected :)
> In the txt file created, I have the list of all the files and the
> original text from the file where the string to be searched occurs but
> this file has some unicode characters which prevents it to be read and
> processed properly. :(
> I basically get a lot of " " in the result test file which makes perl
> act wierd.
>
> Can u please suggest a way to read a text file which has unicode
> characters??
> I do NOT want to create seperate parsers for the different file types
> (things like ParseExcel) as it will increase the complexity and will
> need a lot of effort.
>
> Cheeeeers!!
>
> K R N!!?!



Posted by Ian Wilson on May 15, 2007, 5:36 am
Please log in for more thread options


Top-posting corrected, Please don't top-post.

Mick wrote:
>
>
>> I am writing a script to read various file types (doc, xls, pdf,
>> html etc.) and search for certain keywords. Without caring for the
>> file formats, I used 'findstr' system call from perl with the
>> keyword for each file and directed the output to a text file. The
>> results were not as bad as I had expected :) In the txt file
>> created, I have the list of all the files and the original text
>> from the file where the string to be searched occurs but this file
>> has some unicode characters which prevents it to be read and
>> processed properly. :( I basically get a lot of " " in the result
>> test file which makes perl act wierd.
>>
>> Can u please suggest a way to read a text file which has unicode
>> characters??

I'm pretty sure that current versions of Perl are happy to process Unicode
perldoc perlunicode

If I wanted to ignore characters that are outside the ASCII printable
set then I'd investigate Perl's 'tr'. `perldoc perlop` suggests
tr/a-zA-Z/ /cs; # change non-alphas to single space

>> I do NOT want to create seperate parsers for the
>> different file types (things like ParseExcel) as it will increase
>> the complexity and will need a lot of effort.
>>

I suspect there's no guarantee that arbitrary file types will store your
keywords in a recognisable form. A file might store "KEYWORD" as
"KExxxxxYWxxxxxOxxxxRD" for example. I'd guess this is particularly
likely in PDF, especially if it is kerning text. Some might use UTF8
encoding others might use UTF16 or some non-unicode encoding. Some might
compress or encode the text so it no longer appears in ASCII.

>
> sorry guys.. but the unicode characters as they appear as rectangles
> in my text file (all appear as the same), are not printed when
> posting a message on this forum!!
>

You are using Google Groups and it seems to think your character set is
Latin1 not Unicode. Your posting has this header:
Content-Type: text/plain; charset="iso-8859-1"

Possibly you are viewing your "text file" in an application that is not
Unicode aware or is not using a font that has glyphs for the particular
Unicode characters in the file.

Similar ThreadsPosted
install HTML::Template - Problem reading cache file / Bad file number July 24, 2004, 7:55 pm
Increase file reading efficiency March 19, 2008, 1:26 pm
Excel file manipulation in HPUX system October 14, 2004, 8:13 am
Spreadsheet-ParseExcel: Parsing various MS Excel file versions / grabing checkbox values? September 17, 2004, 3:11 am
Reading AND writing Excel spreadsheets April 30, 2005, 10:05 am
A do-file location: how the code inside that do-file find it? January 20, 2008, 12:32 am
DBD:mysql doesn't read mysql option file /etc/my.cnf file January 27, 2005, 11:19 pm
Availability of a tool for database contents reporting September 7, 2004, 9:41 pm
How to package the _Inline folder contents in an exe produced using Perl. August 29, 2005, 2:43 am
Win32::File August 29, 2004, 9:08 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap