Do you have a question? Post it now! No Registration Necessary. Now with pictures!
September 22, 2004, 5:34 pm
rate this thread
I've written a content management system that I'm now selling to my
customers. It's very nice when we have a blank canvas of a site, but a pain
in the arse when there is already a site in place.
What I'm in the process of *trying* to put together is a script that would
do the following:
A simple form where you put the address of the site with the static pages
The script then spiders through the site, takes everything between <body>
and </body> and chucks the rest away
It would then take out all class definitions and all embedded styles like
font tags etc but leaves tables, <p> <H?> etc
This would leave a very plain page of HTML that would be inserted into a
database. CSS would control the fonts etc. I'm aware that there would need
What I want to know is
1. Has it been done and, if so, where might I find something like this
2. Might it have any commercial value to other developers?
Regarding 2, I'm thinking how much time something like this might save me if
I have to convert anything more than a few pages of static HTML into
something that I can put in a database.
Your thoughts would be appreciated.
- Nikolai Chuvakhin
September 22, 2004, 5:16 pm
Re: Importing pages
Bad idea. As of HTML 4.0, <head> and <body> tags are optional...
Also, why spider the site, if you can (theoretically, at least)
crawl the local file system?
The spidering part along with storing in databases is what search
engines do. What you need to add is the processing in-between.
Developers, I doubt it. Content managers, possibly...