Do you have a question? Post it now! No Registration Necessary. Now with pictures!
- Posted on
- specific text extraction from pdf
- Aka Unknown
August 11, 2009, 9:44 am
rate this thread
A pdf file is uploaded. The file can look in a million of manner,
right? Im talking about its disposition. What I need to do is to fetch
each odd row of the text (but only the paragraph text. Extracting text
from pdf often means you also get that text that for example is inside
an image) and cover that line with black color, so the text line is
not readable anymore.
Or maybe I want to do the same but for each odd word in the
As you understand, it is about:
1) Extract text from pdf
2)Analyse it. What text was "real" text, and what text was unimportant
(table of content, text that explain an image, text inside the image,
page header etc).
3)Rewrite the pdf file in exactly the same manner, but while rewriting
the file, do the text manipulation (black on each two lines for
How can I solve this? Im quite sure, now, after having reserached a
lot, that this is almost a mission impossible.
The most advanced I could find for this kind of manipulation is
Pdflib, and especially the library TET. But I couldnt find a good way
to analyse text in the way I described above. Anybody out there have
been working with something like that? And can give me an advice on
how to proceed?
August 11, 2009, 10:22 am
Re: specific text extraction from pdf
Aka Unknown escribió:
In a semantic sense, the PDF format sucks. It doesn't handle any concept
like "paragraph", "table" or "row". It just handles little boxes than
happen to contain text and, gracefully positioned around the page, look
like a document to human eyes. It's great for printing, but totally
useless for automated information exchange.
I've learnt to never say never but I don't think that what you're trying
to do is feasible, unless you find a very good third party tool that
implements the PDF-equivalent of a OCR utility for pictures. Google has
one (you can see an HTML version of indexed PDFs) but even Google's
utility works awfully in most documents.
Extracting text itself is quite easy... given that the text boxes are
generated in the reading order.
Editing an existing PDF should be doable, see:
I'm not sure though about the possibility of removing existing parts
(beyond drawing a white rectangle on top).
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
- » ssh on command line: force using a group size (prime size) of 1024 (and no...
- — The site's Newest Thread. Posted in » Secure Shell Forum