FAQ 9.5 How do I extract URLs?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

This is an excerpt from the latest version perlfaq9.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .


9.5: How do I extract URLs?

    You can easily extract all sorts of URLs from HTML with
    "HTML::SimpleLinkExtor" which handles anchors, images, objects, frames,
    and many other tags that can contain a URL. If you need anything more
    complex, you can create your own subclass of "HTML::LinkExtor" or
    "HTML::Parser". You might even use "HTML::SimpleLinkExtor" as an example
    for something specifically suited to your needs.

    You can use "URI::Find" to extract URLs from an arbitrary text document.

    Less complete solutions involving regular expressions can save you a lot
    of processing time if you know that the input is simple. One solution
    from Tom Christiansen runs 100 times faster than most module based
    approaches but only extracts URLs from anchors where the first attribute
    is HREF and there are no other attributes.

            #!/usr/bin/perl -n00
            # qxurl - tchrist@perl.com
            print "$2\n" while m{
                    < \s*
                      A \s+ HREF \s* = \s* (["']) (.*?) \g1
                    \s* >


The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in

Site Timeline