crawling the net...

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I'm making a program to crawl the internet. It works by retrieving all links
in a page, downloading the page of each link and again retrieving all the
links. (If there is better ways I'd like to hear)

My problem is relative links (like "../../wohoo.asp"). What is the smartest
way to get the full url ( )? Do I have to parse
the relative link in relation to the url where the relative link was found
and then concatenate it? Does anyone know how other search-engines/ crawlers
walk the net?

Thanks  :)


Re: crawling the net...

ask josephsen posted:

Quoted text here. Click to load it

You should have posted this on:


It would've been more on-topic _there_.


Re: crawling the net...

Hi Ask,

You could try using the features of Path.GetFullPath which collapses /..=
/ =

and /./ and returns the proper path.  However, it insists on adding the =

application path so you will need to do something like

string newUrl = =


It will switch the / to \ though. Oh, and remove the http:// from the ur=
l =


There are plenty web crawlers, just do a web searh on "web crawler" and =

"web bot".

Happy coding!
Morten Wennevik [C# MVP]

Re: crawling the net...

I'm not developing webcrawlers, but a quick thought of mine is

string link = "../../wohoo.asp"
string thisPageURL = " "
stirng [] linkParts = System.Text.RegularExpressions.Regex.Split(link,
"x2Ex2E/"); // split on ../
string [] URLParts = System.Text.RegularExpressions.Regex.Split(thisPageURL,

the length of linkParts.Lenght - 1 will now contain the wanted numbers of
"../" "directory recursion" and the last element will be the wanted page
the URL to the new page will be concatenated from the URLParts array,
exluding the the linkPartLength number of elements, and the last element in

Just a quick shot at an solution...


"ask josephsen" <jaj(((a)))> wrote in message
Quoted text here. Click to load it

Re: crawling the net...

ask josephsen <jaj(((a)))> spoke thus:

Quoted text here. Click to load it

(You could look at how wget is implemented.  Or, better, just USE wget.)

Your post is off-topic for comp.lang.c++.  Please visit /

for posting guidelines and frequently asked questions.  Thank you.

Christopher Benson-Manica  | I *should* know what I'm talking about - if I
ataru(at)    | don't, I need to know.  Flames welcome.

Re: crawling the net...

You might want to look into widely available
open-source software like libwww that already
handles many details of interacting with
standard items like HTTP and URLs. In particular:

 > The webbot is a very fast Web walker with
 > support for regular expressions, SQL logging
 > facilities, and many other features. The
 > webbot comes with the libwww codebase.
 > It can be used to check links, find bad HTML,
 > map out a web site, download images, etc.

from: /

libwww works on both Windows and Unix.

Site Timeline