Full text search in PDF and Word files ?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I need to perform full text searches on a batch of PDF and Word files.  
What is the best way to go?

After some research, I'm thinking of extracting the plain text from the  
files with "pdftotext" and "catdoc", hamonizing the various possible  
encodings to UTF-8, storing the text in a MySQL database, and then  
using the full text search capabilities of MySQL.
Do you think that would work well? I am told that the files are mostly  
text and won't be longer than 30 pages.


My email address doesn't ride a horse.

Re: Full text search in PDF and Word files ?

I do this with Oracle Text -- however the documents are not stored in
the database, in fact Oracle is just used to index them (I store a
filepath and filename)-- of course I do other things with Oracle
however this has been a supurb solution for me and faster than you
could ever believe.

Essentially you get to search unlimited documents in their native
format without actually having to do any real work for it.

Site Timeline