pdf to text

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I am looking for a way to convert PDF files into text content. I don't  
care about layout or formatting, just the plain text that I can use to  
search against in a database.

I've look into the pdftotext tool from:

However, when I use it via the command line, it works fine. If I issue  
the same command via a system() call, there are major problems that  
cause the server to crash. (Don't know why, there aren't any error  
messages been generated anywhere.)

I am looking to use this when a PDF file is uploaded via a form and  
store the text in a database for a search function.


-- Justin

Re: pdf to text

Justin Koivisto wrote:
Quoted text here. Click to load it

I vaguely remember using Ghostscript for that...  


Re: pdf to text

I have not used pdftotext via system(). However you could try different
versions of pdftotext. In my experience the version you use can have
quite different effects. Different versions should be easily available.
It's also possible to use ascii2txt, which depends on Ghostscript I
think. When I tried it I got into a muddle of versions though, and
pdftotext was much easier.

Re: pdf to text

Quoted text here. Click to load it

What do you mean when you say the server crashes? The Apache process  
dies? The entire machine locks up? The server physically falls off the  
rack and lands on the floor?

How about doing an experiment where you use system() to call a shell  
script that sets up some debugging and dumps the environment, and see  
what you come up with?

Photos from 38 countries on 5 continents: http://travel.u.nu
Latest photos: Australia; Malaysia; Burma; Thailand; Hong Kong
Airports of the world: http://airport.u.nu

Site Timeline