Parsing a website - strategy

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

recently I got a project to get info from different websites and to put
the info into a DB.
Now, I was wondering what is the best technique to implement something
like that.

How I should open the pages from other websites. With fopen, throught a
socket or with a curl.

After that what is the faster way to parse a whole page for info.. and
offcourse to parse it little times to get different info from the same


Re: Parsing a website - strategy

aka_eu wrote:
Quoted text here. Click to load it

Either way works, depends what website you are accessing and what you
need to do. If your answer to any of the questions if yes then use
Will your script need to auto-submit any forms to these websites? Do
any of the sites use cookies? If a page is inaccessible do you need to
know why?

file_get_contents is the easiest way but not informative if the webpage
was inacessible and it can only perform simple get requests.

Curl can has comprehensive error reporting and you can post forms using
setopt CURLOPT_POST and CURLOPT_POSTFIELDS, and it can deal with cookie
based websites, pretend its a browser/bot and has plenty of other
useful stuff.

You could do all this yourself using sockets but its already been done
with curl and sooo tedious.

Quoted text here. Click to load it

Best use DOM.

I've seen some people use regular expressions to do it but the regexes
soon end up being a nightmare to maintain or change when the website
inevitably changes. But if you're only looking for a few pieces of
information from a few sites preg_match could work.

With Dom you parse the page into a domtree using
DOMDocument->loadHTML(), then use the dom methods and xpath to get what
you want. Especially xpath....

Don't know if its fastest to execute during runtime but if anyone knows
a more flexible, useful way of data mining I need to know.

The dom method getElementById doesn't work unless the page has a proper
doctype ( meaning most webpages ) explains the
problem and the solutions, there's a straightforward example of using
xpath as well. is a good
xpath tutorial, ugly site but there's plenty of good examples to learn
from and an interactive lab.



Site Timeline