extraction of web page data

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi all !

I have to extract data from a remote page. I used cURL functions and
was able to extract data as a string as well as in a file. now my
problem is how to extract plaintext from the text file in which remote
page has been saved or from the string containing web page data.


Re: extraction of web page data

Aditi Jindal escribió:
Quoted text here. Click to load it

I'm not fully sure of what you're asking, but if you mean extracting
data from HTML documents you can use the DOMfunctions. E.g.:


$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.php.net /');

$links = $dom->getElementsByTagName('a');
foreach($links as $i){
    echo $i->textContent . PHP_EOL;


Alternatively, you can bypass the hassle of parsing third-party invalid
HTML with phpQuery:

http://code.google.com/p/phpquery /

-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://bits.demogracia.com
-- Mi web de humor al baño María: http://www.demogracia.com

Re: extraction of web page data

On Feb 12, 3:59=A0pm, "=C1lvaro G. Vicario"
Quoted text here. Click to load it

Hello Alvaro, thanks for your reply.:)
I tried the code but it is not showing anything, only a blank page is
So i tried cURL functions like curl_init(),curl_setopt() etc. I was
able to store the page in a text file, till here everything is good.
But, the real problem is how to fetch required data from the text file
and store it in the database.

Re: extraction of web page data

Aditi Jindal wrote:
Quoted text here. Click to load it

You'll have to parse the file, just like you would any other file.

If the html is well formatted, Alvaro's suggestion should work fine.
However, if it's too far from the spec, the DOM code can have trouble
handling it (no surprise there!)

Of course, there are also other ways to parse a page - strcmp(),
preg_match and the like.  But some end up being much more complicated to

Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.

Re: extraction of web page data

Quoted text here. Click to load it

  echo print_r($i) . "<br>";

Quoted text here. Click to load it

Thank you for that example, as I'm just starting to feel the need for such
things in my project.

I understand that the example is getting the <a ...> tag elements. What I
need is to get the text value from within such 'td' tags as:

<tr bgcolor="#FFFFFF">
<td><font face="arial,helvetica" size=3>129433</font></td>
<td><font face="arial,helvetica" size=3>Muss</font></td>
<td><font face="arial,helvetica" size=3>Daniel R</font></td>
<td><font face="arial,helvetica" size=3>06/30/2008</font></td>

I would appreciate some guidance on how to extract/process the particular
Object of interest generated with getElementsByTagName('td') such as:

DOMElement Object

Member #LastFirstExpiration129433
Daniel R

The many Objects don't have an index of any kind to work with as an array
does, so I'm at a bit of a loss as to how to proceed with processing the
necessary DOMElement Object data.

Thank you.

Re: extraction of web page data

Quoted text here. Click to load it

hey !! check this link and download the book. http://www.schrenk.com/nostar =

Re: extraction of web page data

Quoted text here. Click to load it

Have a look at strip_tags() to strip out the html


Site Timeline