Importing pages

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi all

I've written a content management system that I'm now selling to my
customers.  It's very nice when we have a blank canvas of a site, but a pain
in the arse when there is already a site in place.

What I'm in the process of *trying* to put together is a script that would
do the following:

A simple form where you put the address of the site with the static pages

The script then spiders through the site, takes everything between <body>
and </body> and chucks the rest away

It would then take out all class definitions and all embedded styles like
font tags etc but leaves tables, <p> <H?> etc

This would leave a very plain page of HTML that would be inserted into a
database.  CSS would control the fonts etc.  I'm aware that there would need
to be some tidying up if there was any javascript or anything and also some
basic formatting.

What I want to know is

1.  Has it been done and, if so, where might I find something like this
2.  Might it have any commercial value to other developers?

Regarding 2, I'm thinking how much time something like this might save me if
I have to convert anything more than a few pages of static HTML into
something that I can put in a database.

Your thoughts would be appreciated.


Re: Importing pages

Quoted text here. Click to load it

Bad idea.  As of HTML 4.0, <head> and <body> tags are optional...
Also, why spider the site, if you can (theoretically, at least)
crawl the local file system?

Quoted text here. Click to load it

The spidering part along with storing in databases is what search
engines do.  What you need to add is the processing in-between.

Quoted text here. Click to load it

Developers, I doubt it.  Content managers, possibly...


Site Timeline