|
Posted by Ben Bullock on April 14, 2008, 12:29 am
Please log in for more thread options
> my $Regex1 = qr
;
> # This regex says "find a string which is probably a URL minus the 'http://'
> # part; save any such found string as a backreference":
> my $Regex1 = qr
;
Also, here [a-z0-9-] (ignoring case) is enough. Your regex will
get things which aren't valid URLs. The following catches anything
valid:
my $validdns = '[0-9a-z-]';
m/\b(($validdns\.)$validdns)\b/i # Catches any valid thing.
> sg;
> print ($_);
You can just say
print;
here if you like.
|
|
Posted by Robbie Hatley on April 14, 2008, 1:40 pm
Please log in for more thread options
"Ben Bullock" wrote:
>
> > # This regex says "find a string which is probably a URL minus the 'http://'
> > # part; save any such found string as a backreference":
> > my $Regex1 = qr
>
> ... [a-z0-9-] (ignoring case) is enough. Your regex will
> get things which aren't valid URLs. The following catches anything
> valid:
>
> my $validdns = '[0-9a-z-]';
> m/\b(($validdns\.)$validdns)\b/i # Catches any valid thing.
I can see that your pattern looks for just the dns part
of the url, which has fewer valid characters; but since it
doesn't look for "/", it will convert this string:
references in Sec 35.74 paragraph B
to
references in Sec http://35.74 paragraph B
I believe you're right in that it will find most valid dns
strings; but it also catches things that aren't part of URLs
at all (such as numbers with decimal points), and it rejects
certain well-formed domain strings (such as "j.qbc.net.ca",
which fails the "" assertion).
My pattern at least insists on "stuff.stuff/stuff", so it
rejects "35.74". It rejects domain-level URLs and only
linkifys document-level URLs. That may be a blessing or
a curse, depending on your expectations.
Also, both your pattern and my are broken in that they match
http://www.asdf.com/qwer.html, and indeed convert it to
http://http://www.asdf.com/qwer.html .
Oops! What was really intended was to find "bare" URLs
(without "http://") and tack "http://" on the beginning.
Ok, this should do the trick; it blends features from your
approach and mine, and solves the bugs I just mentioned,
as well as some other bugs I've noticed:
#!/usr/bin/perl
# linkify.perl
# Converts any text document into an HTML document with all of the contents of
# the original, but with any HTTP URLs converted to clickable hyperlinks.
# First print the standard opening lines of an HTML file.
# The title will be "Linkifyed HTML Document",
# the body text is in a "div" element,
# and the paragraphs will have 5-pixel margins on all 4 sides:
use strict;
use warnings;
# Print initial tags for HTML file:
print ("<html>\n");
print ("<head>\n");
print ("<title>Linkifyed HTML Document</title>\n");
print ("<style>p</style>\n");
print ("</head>\n");
print ("<body>\n");
print ("<div>\n");
print ("<pre>\n");
# A valid URL must consist solely of the following 82 characters
#
# alphanumeric: [:alnum:] 62
# reserved: ;/?:@=& 7
# anchor-id: # 1
# encoding: % 1
# special: $_.+!*'(),- 11
# Total: 82
#
# Make a non-interpolated string version of a character class
# consisting of the above 82 URL-legal characters:
# Make a non-interpolated string version of a regex specifying
# a cluster of 1-63 DNS-valid characters:
my $Dns = q<[0-9A-Za-z-]>;
# Make a non-interpolated string version of a regex specifying
# a URL header:
my $Header = q<s?https?://>;
# Make a non-interpolated string version of a regex specifying
# a URL suffix:
my $Suffix = qq<(?:$Dns\.)$Dns/$Legal+>;
# This regex says "find a string which is probably a URL suffix,
# at start of line, and save any such found suffix as a backreference":
my $Regex1 = qr;
# This regex says "find a string which is probably a URL suffix,
# preceded by some space, and save any such found suffix as a backreference":
my $Regex2 = qr;
# This regex says "find a string which is probably a URL with header,
# and save any such found URL as a backreference":
my $Regex3 = qr;
# Now loop through all lines of text in the original file. First add http:// to
# any URLs that need it; then wrap all URLS in "a" and "p" elements, with the
# URL used as both the text and the "href" attribute of the "a" element:
#print $Regex1,"\n";
#print $Regex2,"\n";
#print $Regex3,"\n";
while (<>)
{
# Tack 'http://' onto be beginning of any strings which are
# probably URLS but lack 'http://':
$_ =~ s; # No sense using g here (beginning of line
only).
#print ("Regex1 matched ", $&, "\n");
$_ =~ sg; # This one could be anywhere on the line.
#print ("Regex2 matched ", $&, "\n");
# Wrap each found URL in an html anchor element with the found URL used both
# as the "href" atttribute and as the text:
$_ =~ s{<a href="$1">$1</a>}g;
#print ("Regex3 matched ", $&, "\n");
# Print the edited line. If the line did not contain a URL, it will be
# printed unexpurgated. To redirect output to a file, use ">" on the
# command line.
print;
}
# Print element-closure tags for pre, div, body, html:
print ("</pre>\n");
print ("</div>\n");
print ("</body>\n");
print ("</html>\n");
|
|
Posted by Ben Bullock on April 14, 2008, 7:34 pm
Please log in for more thread options On Mon, 14 Apr 2008 10:40:57 -0700, Robbie Hatley wrote:
> "Ben Bullock" wrote:
>> ... [a-z0-9-] (ignoring case) is enough. Your regex will get
>> things which aren't valid URLs. The following catches anything valid:
>>
>> my $validdns = '[0-9a-z-]';
>> m/\b(($validdns\.)$validdns)\b/i # Catches any valid thing.
>
> I can see that your pattern looks for just the dns part of the url,
> which has fewer valid characters; but since it doesn't look for "/", it
> will convert this string:
>
> references in Sec 35.74 paragraph B
>
> to
>
> references in Sec http://35.74 paragraph B
>
> I believe you're right in that it will find most valid dns strings; but
> it also catches things that aren't part of URLs at all (such as numbers
> with decimal points), and it rejects certain well-formed domain strings
> (such as "j.qbc.net.ca", which fails the "" assertion).
Well OK but if I was going to do this for real, I would use something like
/\b(($validdns\.)(com|net|org|us|uk|ca|jp))\b/i
or similar (I haven't checked this regex with the machine yet but
hopefully you get the picture).
> My pattern at least insists on "stuff.stuff/stuff", so it rejects
> "35.74". It rejects domain-level URLs and only linkifys document-level
> URLs. That may be a blessing or a curse, depending on your
> expectations.
I hadn't really thought this through carefully, I just wanted to make the
point that the &$% stuff is not valid as part of the web address.
> Also, both your pattern and my are broken in that they match
> http://www.asdf.com/qwer.html, and indeed convert it to
> http://http://www.asdf.com/qwer.html .
Mine doesn't do anything at all, I'm not sure it even compiles!
|
|
Posted by Robbie Hatley on April 15, 2008, 4:35 pm
Please log in for more thread options
"Ben Bullock" wrote:
> Well OK but if I was going to do this for real, I would use something like
> /\b(($validdns\.)(com|net|org|us|uk|ca|jp))\b/i
> or similar (I haven't checked this regex with the machine yet but
> hopefully you get the picture).
The problem with "(com|net|org|us|uk|ca|jp)" or similar is that there are
hundreds
or thousands of such valid domain suffixes. You're forgetting "es" (Spain),
"ru" (Russia), "uk" (Ukraine), "us" (USA), not to mention "mil", "gov", "edu",
"biz",
"info", etc, etc, etc. That's part of why my URL-matching regex was so vague.
> I just wanted to make the point that the &$% stuff is not valid as part of the
> web address.
Those characters all appear in web addresses. For instance, "&" is used as
a field separator for server-side script (php, Perl, etc) commands embedded in
URLs. Similarly, "?" announces that the next cluster of alphanumeric characters
is a parameter for the previous command. If you reject such characters, you
reject
many valid URLs. Just look at any YouTube URL. This one, for example:
http://uk.youtube.com/watch?v=I9ciR9qR1dU&feature=bz303
Maybe what you meant is that such characters are invalid in domain names;
but I was trying to capture and linkify document URLs, not domain names or
domain-level URLs such as "http://www.acme.com/". Trying to concoct a
foolproof RE that captures every valid URL and rejects every invalid one
is a real piece of work. And any such "perfect" URL-matching RE would
quickly become obsolete anyway as the Internet changes over time.
Hence I tend to go for a vauge RE that I believe captures every valid
document URL, at the cost of occasionally caputuring a few invalid ones.
Unless someone knows a better approach.
--
Cheers,
Robbie Hatley
lonewolf aatt well dott com
www dott well dott com slant user slant lonewolf slant
|
|
Posted by Abigail on April 15, 2008, 4:54 pm
Please log in for more thread options _
Robbie Hatley (see.my.signature@for.my.email.address) wrote on VCCCXLI
`'
`' Maybe what you meant is that such characters are invalid in domain names;
`' but I was trying to capture and linkify document URLs, not domain names or
`' domain-level URLs such as "http://www.acme.com/". Trying to concoct a
`' foolproof RE that captures every valid URL and rejects every invalid one
`' is a real piece of work. And any such "perfect" URL-matching RE would
`' quickly become obsolete anyway as the Internet changes over time.
`' Hence I tend to go for a vauge RE that I believe captures every valid
`' document URL, at the cost of occasionally caputuring a few invalid ones.
`' Unless someone knows a better approach.
You mean, something like:
(?:(?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))*))(?:[?](?:(?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~*'()]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)))?))?)|(?:(?:nntp)://(?:(?:(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z]))|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::(?:(?:[0-9]+)))?)/(?:(?:[a-zA-Z][-A-Za-z0-9.+_]*))(?:/(?:[0-9]+))?))|(?:(?:file)://(?:(?:(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z]))|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+))|localhost)?)(?:/(?:(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),:@&=]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:/(?:(?:[-a-zA-Z0-9$_.+!*'(),:@&=]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*)))))|(?:(?:ftp)://(?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'();:&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))(?:)@)?(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:/(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))(?:;type=(?:[AIai]))?))?)|(?:(?:tel):(?:(?:(?:[+](?:[0-9\-.()]+)(?:;isub=[0-9\-.()]+)?(?:;postd=[0-9\-.()*#ABCDwp]+)?(?:(?:;(?:phone-context)=(?:(?:(?:[+][0-9\-.()]+)|(?:[0-9\-.()*#ABCDwp]+))|(?:(?:[!'E-OQ-VX-Z_e-oq-vx-z~]|(?:%(?:2[124-7CFcf]|3[AC-Fac-f]|4[05-9A-Fa-f]|5[1-689A-Fa-f]|6[05-9A-Fa-f]|7[1-689A-Ea-e])))(?:[!'()*\-.0-9A-Z_a-z~]+|(?:%(?:2[1-9A-Fa-f]|3[AC-Fac-f]|[4-6][0-9A-Fa-f]|7[0-9A-Ea-e])))*)))|(?:;(?:tsp)=(?:
|(?:(?:(?:[A-Za-z](?:(?:(?:[-A-Za-z0-9]+))[A-Za-z
0-9])?)(?:[.](?:[A-Za-z](?:(?:(?:[-A-Za-z0-9]+))[A-Za-z0-9])?))*))))|(?:;(?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*)(?:=(?:(?:(?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*)(?:[?](?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*))?)|(?:%22(?:(?:%5C(?:[a-zA-Z0-9\-_.!~*'()]|(?:%[a-fA-F0-9][a-fA-F0-9])))|[a-zA-Z0-9\-_.!~*'()]+|(?:%(?:[01][a-fA-F0-9])|2[013-9A-Fa-f]|[3-9A-Fa-f][a-fA-F0-9]))*%22)))?))*)|(?:[0-9\-.()*#ABCDwp]+(?:;isub=[0-9\-.()]+)?(?:;postd=[0-9\-.()*#ABCDwp]+)?(?:;(?:phone-context)=(?:(?:(?:[+][0-9\-.()]+)|(?:[0-9\-.()*#ABCDwp]+))|(?:(?:[!'E-OQ-VX-Z_e-oq-vx-z~]|(?:%(?:2[124-7CFcf]|3[AC-Fac-f]|4[05-9A-Fa-f]|5[1-689A-Fa-f]|6[05-9A-Fa-f]|7[1-689A-Ea-e])))(?:[!'()*\-.0-9A-Z_a-z~]+|(?:%(?:2[1-9A-Fa-f]|3[AC-Fac-f]|[4-6][0-9A-Fa-f]|7[0-9A-Ea-e])))*)))(?:(?:;(?:phone-context)=(?:(?:(?:[+][0-9\-.()]+)|(?:[0-9\-.()*#ABCDwp]+))|(?:(?:[!'E-OQ-VX-Z_e-oq-vx-z~]|(?:%(?:2[124-7CFcf]|3[AC-Fac-f]|4[05-9A-Fa-f]|5[1-689A-Fa-f]|6[05-9A-Fa-f]|7[1-689A-Ea-e])))(?:[!'()*\-.0-9A-Z_a-z~]+|(?:%(?:2[1-9A-Fa-f]|3[AC-Fac-f]|[4-6][0-9A-Fa-f]|7[0-9A-Ea-e])))*)))|(?:;(?:tsp)=(?:
|(?:(?:(?:[A-Za-z](?:(?:(?:[-A-Za-z0-9]+))[A-Za-z0-9])?)(?:[.](?:[A-Za-z](?:(?:(?:[-A-Za-z0-9]+))[A-Za-z0-9])?))*))))|(?:;(?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*)(?:=(?:(?:(?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*)(?:[?](?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*))?)|(?:%22(?:(?:%5C(?:[a-zA-Z0-9\-_.!~*'()]|(?:%[a-fA-F0-9][a-fA-F0-9])))|[a-zA-Z0-9\-_.!~*'()]+|(?:%(?:[01][a-fA-F0-9])|2[013-9A-Fa-f]|[3-9A-Fa-f][a-fA-F0-9]))*%22)))?))*))))|(?:(?:fax):(?:(?:(?:[+](?:[0-9\-.()]+)(?:;isub=[0-9\-.()]+)?(?:;tsub=[0-9\-.()]+)?(?:;postd=[0-9\-.()*#ABCDwp]+)?(?:(?:;(?:phone-context)=(?:(?
:(?:[+][0-9\-.()]+)|(?:[0-9\-.()*#ABCDwp]+))|(?:(?:[!'E-OQ-VX-Z_e-oq-vx-z~]|(?:%(?:2[124-7CFcf]|3[AC-Fac-f]|4[05-9A-Fa-f]|5[1-689A-Fa-f]|6[05-9A-Fa-f]|7[1-689A-Ea-e])))(?:[!'()*\-.0-9A-Z_a-z~]+|(?:%(?:2[1-9A-Fa-f]|3[AC-Fac-f]|[4-6][0-9A-Fa-f]|7[0-9A-Ea-e])))*)))|(?:;(?:tsp)=(?:
|(?:(?:(?:[A-Za-z](?:(?:(?:[-A-Za-z0-9]+))[A-Za-z0-9])?)(?:[.](?:[A-Za-z](?:(?:(?:[-A-Za-z0-9]+))[A-Za-z0-9])?))*))))|(?:;(?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*)(?:=(?:(?:(?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*)(?:[?](?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*))?)|(?:%22(?:(?:%5C(?:[a-zA-Z0-9\-_.!~*'()]|(?:%[a-fA-F0-9][a-fA-F0-9])))|[a-zA-Z0-9\-_.!~*'()]+|(?:%(?:[01][a-fA-F0-9])|2[013-9A-Fa-f]|[3-9A-Fa-f][a-fA-F0-9]))*%22)))?))*)|(?:[0-9\-.()*#ABCDwp]+(?:;isub=[0-9\-.()]+)?(?:;tsub=[0-9\-.()]+)?(?:;postd=[0-9\-.()*#ABCDwp]+)?(?:;(?:phone-context)=(?:(?:(?:[+][0-9\-.()]+)|(?:[0-9\-.()*#ABCDwp]+))|(?:(?:[!'E-OQ-VX-Z_e-oq-vx-z~]|(?:%(?:2[124-7CFcf]|3[AC-Fac-f]|4[05-9A-Fa-f]|5[1-689A-Fa-f]|6[05-9A-Fa-f]|7[1-689A-Ea-e])))(?:[!'()*\-.0-9A-Z_a-z~]+|(?:%(?:2[1-9A-Fa-f]|3[AC-Fac-f]|[4-6][0-9A-Fa-f]|7[0-9A-Ea-e])))*)))(?:(?:;(?:phone-context)=(?:(?:(?:[+][0-9\-.()]+)|(?:[0-9\-.()*#ABCDwp]+))|(?:(?:[!'E-OQ-VX-Z_e-oq-vx-z~]|(?:%(?:2[124-7CFcf]|3[AC-Fac-f]|4[05-9A-Fa-f]|5[1-689A-Fa-f]|6[05-9A-Fa-f]|7[1-689A-Ea-e])))(?:[!'()*\-.0-9A-Z_a-z~]+|(?:%(?:2[1-9A-Fa-f]|3[AC-Fac-f]|[4-6][0-9A-Fa-f]|7[0-9A-Ea-e])))*)))|(?:;(?:tsp)=(?:
|(?:(?:(?:[A-Za-z](?:(?:(?:[-A-Za-z0-9]+))[A-Za-z0-9])?)(?:[.](?:[A-Za-z](?:(?:(?:[-A-Za-z0-9]+))[A-Za-z0-9])?))*))))|(?:;(?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*)(?:=(?:(?:(?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*)(?:[?](?:(?:[!'*\-.0-9A-Z_a-z~]+|%(?:2[13-7ABDEabde]|3[0-9]|4[1-9A-Fa-f]|5[
AEFaef]|6[0-9A-Fa-f]|7[0-9ACEace]))*))?)|(?:%22(?:(?:%5C(?:[a-zA-Z0-9\-_.!~*'()]|(?:%[a-fA-F0-9][a-fA-F0-9])))|[a-zA-Z0-9\-_.!~*'()]+|(?:%(?:[01][a-fA-F0-9])|2[013-9A-Fa-f]|[3-9A-Fa-f][a-fA-F0-9]))*%22)))?))*))))|(?:(?:prospero)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z]))|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::(?:(?:[0-9]+)))?/(?:(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),?:@&=]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:/(?:(?:[-a-zA-Z0-9$_.+!*'(),?:@&=]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))(?:(?:;(?:(?:[-a-zA-Z0-9$_.+!*'(),?:@&]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)=(?:(?:[-a-zA-Z0-9$_.+!*'(),?:@&]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))|(?:(?:tv):(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?))?)|(?:(?:telnet)://(?:(?:(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),;?&=]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))(?::(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),;?&=]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)))?)@)?(?:(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z]))|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::(?:(?:[0-9]+)))?)(?:/)?)|(?:(?:news):(?:(?:[*]|(?:(?:[-a-zA-Z0-9$_.+!*'(),;/?:&=]+|(?:%[a-fA-F0-9][a-fA-F0-9]))+@(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z]))|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))|(?:[a-zA-Z][-A-Za-z0-9.+_]*))))|(?:(?:wais)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z]))|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::(?:(?:[0-9]+)))?/(?:(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))(?:[?](?:(?:(?:[-a-zA-Z0-9$_.+!*'(),;:@&=]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))|/(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))/(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)))?))|(?:(?:gopher)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z]))|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::(
?:(?:[0-9]+)))?/(?:(?:(?:[0-9+IgT]))(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),:@&=]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))))|(?:(?:pop)://(?:(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),&=~]+|(?:%[a-fA-F0-9][a-fA-F0-9]))+))(?:;AUTH=(?:[*]|(?:(?:(?:[-a-zA-Z0-9$_.+!*'(),&=~]+|(?:%[a-fA-F0-9][a-fA-F0-9]))+)|(?:[+](?:APOP|(?:(?:[-a-zA-Z0-9$_.+!*'(),&=~]+|(?:%[a-fA-F0-9][a-fA-F0-9]))+))))))?@)?(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z]))|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::(?:(?:[0-9]+)))?))
I don't believe in capturing a few invalid ones - nor in rejected valid ones.
Abigail
--
$_ = "\x3C\x3C\x45\x4F\x54"; s/<<EOT/<<EOT/e; print;
Just another Perl Hacker
EOT
|
| Similar Threads | Posted | | "our" from XS and some other questions | April 21, 2006, 4:54 am |
| questions about RE! | February 9, 2007, 4:21 pm |
| Where to ask mysql questions? | July 29, 2005, 7:55 pm |
| Hash questions | August 4, 2005, 7:02 am |
| some perl questions | October 24, 2005, 8:18 am |
| 2 basics questions: 1)'a' < 'b' 2)Run, but is it ok? | January 18, 2006, 9:13 am |
| concurrency with DBI questions | February 28, 2006, 12:18 pm |
| Other XS progamming questions | April 14, 2006, 10:33 am |
| Questions about Inline::C | September 28, 2006, 3:58 am |
| Style questions | January 25, 2007, 5:48 pm |
|