Save web page as text file

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I need to save the contents of a web page as text in a string.
With file_get_contents I can save the HTML code, but I need only the output
of the pages.
Is it possible to do that with PHP? How?

Thanks to evryone.

Re: Save web page as text file

On Sun, 2010-03-07 at 12:29 +0100, Leo.C wrote:
Quoted text here. Click to load it

yes. With strip_tags()

Re: Save web page as text file

yes. With strip_tags()

No, it's not so easy, because the web page was created with javascript and
flash code, and there isn't the output that I need in the html code.

Re: Save web page as text file

Leo.C wrote:
Quoted text here. Click to load it

Then no, you can't.  PHP does not interpret javascript or flash.

Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.

Re: Save web page as text file

El 07/03/2010 13:46, Leo.C escribió/wrote:
Quoted text here. Click to load it

You are suggesting that you want to write a web browser in PHP with
JavaScript runtime and Flash plugin but, of course, that cannot be true.
My wild guess is that you want a tool to generate screenshots of live
sites so you can offer previews in a link directory or something
similar. You won't be able to do it in PHP. You need to install a real
browser in your server (a browser with a GUI, such as Firefox) and set
up an automated tool to launch URLs and grab screenshots. It can be done
and it's actually been done before but it's a complex and resource
consuming task and PHP is totally unsuitable since it's not adequate to
control GUI apps and access the desktop.

-- - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web:
-- Mi web de humor satinado:

Re: Save web page as text file

Leo.C wrote:
Quoted text here. Click to load it

Looked like an interesting problem so gave it a shot. The following does
not save to file, but will remove all tags.

Hope it gives you a good start. BTW, I used:

Webbots, Spiders, and Screen Scrapers
by Michael Schrenk

as a resource (modified functions came from book)

I am sure there is a better way, but this is what I came up with :D

############## CODE BELOW ###########################

$url = '../index.html';
$web_page = file_get_contents($url);
// Remove all JavaScript
$noscript = remove($web_page, "<script", "</script>");
// Strip out all HTML formatting
$noformat = strip_tags($noscript);

$noformat = str_replace("\t", "", $noformat);     // Remove tabs
$noformat = str_replace("&nbsp;", "", $noformat); // Remove non-breaking
$noformat = str_replace("\n", "", $noformat);     // Remove line feeds
echo $noformat;

function remove($string, $open_tag, $close_tag)
     # Get array of things that should be removed from the input string
     $remove_array = parse_array($string, $open_tag, $close_tag);

     # Remove each occurrence of each array element from string;
     for($xx=0; $xx<count($remove_array); $xx++)
         $string = str_replace($remove_array, "", $string);

     return $string;

function parse_array($string, $beg_tag, $close_tag)
     preg_match_all("($beg_tag(.*)$close_tag)siU", $string, $matching_data);
     return $matching_data[0];

Site Timeline