I have a content management system that has links within the content
field in the database and I need to verify if those links are correct.
What I need to have happen is have a php script query the database and
then parse through the content field to find all the <a href> tags to
get the href attribute value and the link text.

Does anyone have a way of doing this or a regex to do this?


Re: Parsing content for links

Tony schreef:
preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"']+".
$html, &$matches);

Arjen - Mijn site over honden

Re: Parsing content for links

Tony wrote:
Yeah, regex would be easiest, and there should be plenty out there,  
but I might do something like this:

$re = '%
<a[^<>]+        # href may or may not come first
href=(['"])        # capture single/double quote

# match a valid URI
    [\w.-]+:(?://)?    # scheme
    [^?"]+        # authority

    # possible query string and fragment
        \? [^#]+
        (?: \# [^"]+ )?

            # captured quote from above
[^<>]*            # possible remaining attributes
 >( .*? )        # allow for nested tags
</a>            # closing <a> tag

The match for the URI would be in $match[2] and the text for the <a>  
tag is in $match[3].

Just use this $re var in the preg_* functions.

Hope this helps,

