Looking for way to automate PDF index (or menu) generation

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I'm looking to create a PHP script that will automatically generate an
index/menu/list (whatever) based on the PDF files that are within a
particular directory.  I would like the script to be able to parse out the
title, description, author(s), and date from the documents, and use that
information to create the index.

Any suggestions on how to do something like this?



Re: Looking for way to automate PDF index (or menu) generation

Mike wrote:

Quoted text here. Click to load it

You first must be able to parse the binary data in the PDF file.  A google
search for pdf2txt pdf2text pdftotext will give you some options...

open the directory (opendir/readdir) to get the contents
then for each pdf file, convert it to text or HTML or ... and parse the
information from there and generate your output.

/* part of code from something similar */
         echo "<form method=post><br><select name=csvfile>";
         if ($csvs = @opendir("./"))
                 while($filename = readdir($csvs))
                   if (eregi("^[A-Z0-9_]+\.pdf;", $filename))
                         echo "<option value=$filename>$filename</option><br>";
                /* put code to parse and display the file info here */
         echo "</select><br><p>";


PDF to HTML Conversion Tools

* Adobe's online PDF Converter will convert PDF to HTML, for free, one file at a

* pdftohtml is free open source converter that runs on the Unix command line, it
is sometimes incorporated by search engines as part of the indexing process.

* Xpdf is free open source software, includes a viewer and components for
parsing PDF documents.

* Adobe recommends BCL Magellan which converts PDF files to HTML, preserving the
structure of the page, graphics, lines, hyperlinks and so on.

* Clickcat-P2H is another converter program, which offers a downloadable trial
version and various special features.

* Very PDF.com's PDF2HTML application can do on-the-fly or batch conversions,
also has a free trial version, and source code available.

good luck...

Michael Austin.
Consultant - Available.
Donations welcomed. http://www.firstdbasource.com/donations.html

Re: Looking for way to automate PDF index (or menu) generation

Thanks for the help!  I used your suggestion to create a "hacked" approach
for now:

1. Create a '.txt' file every PDF; the contents of the '.txt' file is simply
the title of the corresponding PDF.
2. Name the PDFs and '.txt' files based on the date of the article.
3. Using PHP, and opendir/readdir, create a table from all the '.txt' files.
Sort them (again, the filename contains the date). The basename of the
filename, minus extension, is the date - which I put into the first column
of the table.  I then extract the title from the contents of each '.txt'
file using the file_get_contents() function.  The title then becomes the
second column in the table.
4. Both the first column (date) and second column (title) are hyperlinked to
the corresponding PDF file.

Of course, I have to manually rename the files using the article date, and I
also have to manually create the '.txt' files containing the title of each
PDF, but at least I was able to get something up and running yesterday.

I'm gonna look into automating the process using one of the tools that you
mention.  Main problem is that I'm not sure if my web hosting provider
allows binaries.

I'm just getting started with PHP, but it sure is easy to use so far.

Thanks again,


Quoted text here. Click to load it
Quoted text here. Click to load it
file at a
Quoted text here. Click to load it
line, it
preserving the

Site Timeline