FAQ 6.4 How do I match XML, HTML, or other nasty, ugly things with a regex?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

This is an excerpt from the latest version perlfaq6.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .


6.4: How do I match XML, HTML, or other nasty, ugly things with a regex?

    (contributed by brian d foy)

    If you just want to get work done, use a module and forget about the
    regular expressions. The "XML::Parser" and "HTML::Parser" modules are
    good starts, although each namespace has other parsing modules
    specialized for certain tasks and different ways of doing it. Start at
    CPAN Search ( http://search.cpan.org ) and wonder at all the work people
    have done for you already! :)

    The problem with things such as XML is that they have balanced text
    containing multiple levels of balanced text, but sometimes it isn't
    balanced text, as in an empty tag ("<br/>", for instance). Even then,
    things can occur out-of-order. Just when you think you've got a pattern
    that matches your input, someone throws you a curveball.

    If you'd like to do it the hard way, scratching and clawing your way
    toward a right answer but constantly being disappointed, besieged by bug
    reports, and weary from the inordinate amount of time you have to spend
    reinventing a triangular wheel, then there are several things you can
    try before you give up in frustration:

    *   Solve the balanced text problem from another question in perlfaq6

    *   Try the recursive regex features in Perl 5.10 and later. See perlre

    *   Try defining a grammar using Perl 5.10's "(?DEFINE)" feature.

    *   Break the problem down into sub-problems instead of trying to use a
        single regex

    *   Convince everyone not to use XML or HTML in the first place

    Good luck!


The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in

Site Timeline