specific text extraction from pdf

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I've researched a lot, but still not found the solution. Let me

A pdf file is uploaded. The file can look in a million of manner,
right? Im talking about its disposition. What I need to do is to fetch
each odd row of the text (but only the paragraph text. Extracting text
from pdf often means you also get that text that for example is inside
an image) and cover that line with black color, so the text line is
not readable anymore.

Or maybe I want to do the same but for each odd word in the

As you understand, it is about:

1) Extract text from pdf
2)Analyse it. What text was "real" text, and what text was unimportant
(table of content, text that explain an image, text inside the image,
page header etc).
3)Rewrite the pdf file in exactly the same manner, but while rewriting
the file, do the text manipulation (black on each two lines for

How can I solve this? Im quite sure, now, after having reserached a
lot, that this is almost a mission impossible.
The most advanced I could find for this kind of manipulation is
Pdflib, and especially the library TET. But I couldnt find a good way
to analyse text in the way I described above. Anybody out there have
been working with something like that? And can give me an advice on
how to proceed?

Re: specific text extraction from pdf

Aka Unknown escribió:
Quoted text here. Click to load it

In a semantic sense, the PDF format sucks. It doesn't handle any concept
like "paragraph", "table" or "row". It just handles little boxes than
happen to contain text and, gracefully positioned around the page, look
like a document to human eyes. It's great for printing, but totally
useless for automated information exchange.

I've learnt to never say never but I don't think that what you're trying
to do is feasible, unless you find a very good third party tool that
implements the PDF-equivalent of a OCR utility for pictures. Google has
one (you can see an HTML version of indexed PDFs) but even Google's
utility works awfully in most documents.

Quoted text here. Click to load it

Extracting text itself is quite easy... given that the text boxes are
generated in the reading order.

Quoted text here. Click to load it

Editing an existing PDF should be doable, see:

http://www.setasign.de/products/pdf-php-solutions/fpdi /

I'm not sure though about the possibility of removing existing parts
(beyond drawing a white rectangle on top).

Quoted text here. Click to load it

-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com

Site Timeline