Please test/review my Regex to locate hyperlink in text

My requirements are as follow,

1) Find all hyperlinks in a given text document...
2) Parse for certain links and replace them if need be, (in another
parser function).
3) Any attributes given in the hyperlink can be ignored and will be
faithfully returned in the matches.
4) All JavaScript and the likes are pre-stripped so the text can be
assumed to be 'safe', (if it is not safe then it is not the job of this
regex to handle it).

// -----------------------------------
// my pattern...
$pattern = '/<a (.*?)href=[\"']??(.*?)\/\/(.*?)[\s\"'](.*?)>(.*?)<\/a>/i';

// the call back function
$body = preg_replace_callback($pattern, 'my_parser', $body);

// -----------------------------------

The way I see it this should work for...

- <a href=''>some text</a>
- <a href="">some text</a>
- <a>some text</a>
- <a href=' '>some text</a>
- <a href=" ">some text</a>
- <a href= some text</a>

- <a href='' tagret=_blank>some text</a>
- <a href="" tagret=_blank>some text</a>
- <a tagret=_blank>some text</a>
- <a href=' ' tagret=_blank>some text</a>
- <a href=" " tagret=_blank>some text</a>
- <a href= tagret=_blank>some text</a>

Can you poke holes in my regex please :)
Any suggestions/better regexs?

Many thanks


Re: Please test/review my Regex to locate hyperlink in text

El 21/02/2011 7:06, Simon escribió/wrote:
If you are looking for <a> tags then it isn't a plain text document,
it's an HTML document. Unless it's just an exercise to learn how to use
regular expressions, you can simply do something like this:


$url = '';

$html = file_get_contents($url);
$doc = new DOMDocument;

$links = $doc->getElementsByTagName('a');
foreach($links as $a){
    echo $a->nodeValue . ': ' . $a->getAttribute('href') . PHP_EOL;


Afterwards, you can analyse URLs with parse_url():

