Click here to get back home

WWW::Mechanize doesn't always follow_link(text

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
WWW::Mechanize doesn't always follow_link(text M.O.B. i L. 04-20-2008
Get Chitika Premium
Posted by Ben Morrow on April 28, 2008, 2:16 pm
Please log in for more thread options
[sorry about the doubled .sig in my previous post. I'll try not to let
it happen again... :(]

> Ben Morrow schreef:
> > Dr.Ruud:
> >> Martijn Lievaart:
>
> >>> ISO-Latin-1
> >>
> >> Normally called "ISO 8859-1" or "ISO Latin 1" or just "Latin-1".
> >
> > They are different... ISO Latin 1 is a character set (an unordered
> > collection of characters). ISO-8859-1 is a particular encoding of that
> > character set as 8-bit integers. There are others; in particular some
> > EBCDIC codepages.
>
> "ISO-8859-1" wasn't mentioned in the part that you quote, so I don't see
> what you mean with "They".

I'm confused. You said "ISO 8859-1" and "ISO Latin 1" as though they
were equivalent, which they aren't. If you're trying to make
"ISO 8859-1" (sans hyphen) equivalent to "ISO Latin 1" but "ISO-8859-1"
(with hyphen) not, then I'd call that more than a little confusing. For
a start, how would you interpret "ISO 8859-9"? As the Latin-9 character
set used by ISO-8859-15, or as the ISO-8859-9 encoding of the Latin-5
character set?

FWIW, Perl agrees with me:

~% perl -MEncode -le'print Encode::resolve_alias "ISO 8859-9"'
iso-8859-9
~% perl -MEncode -le'print Encode::resolve_alias "ISO Latin-9"'
iso-8859-15

though allowing 'Latin-N' to mean 'the usual 8859-N encoding of the
Latin-9 character set' is arguably only increasing the confusion between
the two.

Ben

--
For the last month, a large number of PSNs in the Arpa[Inter-]net have been
reporting symptoms of congestion ... These reports have been accompanied by an
increasing number of user complaints ... As of June,... the Arpanet contained
47 nodes and 63 links. [ftp://rtfm.mit.edu/pub/arpaprob.txt] * ben@morrow.me.uk

Posted by RedGrittyBrick on April 26, 2008, 6:20 pm
Please log in for more thread options
szr wrote:
> RedGrittyBrick wrote:
>> szr wrote:
>>> He's after a ' ', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>>>
>> s/ASCII/Unicode/
>
> No, it's ASCII.

Lots of people make this mistake. As your first reference says, ASCII is
a 7-bit character set and does not define a character at code-point 160.

> Extended Ascii to be precise.

To be imprecise!

There are many different incompatible character sets and encodings that
claim to be "Extended ASCII"

Read http://en.wikipedia.org/wiki/Extended_ascii
Especially
http://en.wikipedia.org/wiki/Extended_ascii#Character_set_confusion

See 160 = "lowercase a acute" in these "Extended ASCII" tables:

http://www.webopedia.com/TERM/E/extended_ASCII.html
http://www.telacommunications.com/nutshell/extascii.htm
http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm
http://telecom.tbi.net/asc-ibm.html

--
RGB

Posted by Dr.Ruud on April 28, 2008, 3:23 am
Please log in for more thread options
szr schreef:
> RedGrittyBrick:
>> szr:

>>> He's after a ' ', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>>
>> s/ASCII/Unicode/
>
> No, it's ASCII. Extended Ascii to be precise.

The ASCII character set is a 7-bit code and it contains 128 characters,
not more.
See also `man ascii`.

--
Affijn, Ruud

"Gewoon is een tijger."


Posted by Dr.Ruud on April 28, 2008, 3:15 am
Please log in for more thread options
RedGrittyBrick schreef:
> szr:

>> He's after a ' ', which us a non-breaking space, which is ASCII
>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>
> s/ASCII/Unicode/

Exactly. ISO-8859-* too.

--
Affijn, Ruud

"Gewoon is een tijger."

Posted by Martijn Lievaart on April 28, 2008, 4:55 am
Please log in for more thread options
On Mon, 28 Apr 2008 09:15:20 +0200, Dr.Ruud wrote:

> RedGrittyBrick schreef:
>> szr:
>
>>> He's after a ' ', which us a non-breaking space, which is ASCII
>>> 0xA0 hex or 160 dec. ' ' can even be re-written as ' ' .
>>
>> s/ASCII/Unicode/
>
> Exactly. ISO-8859-* too.

No, no, HTML uses Unicode codepoints (which in this case coincide, but
that's beside the (code)point).

M4

Similar ThreadsPosted
use WWW::Mechanize; May 11, 2006, 6:28 pm
LWP::UserAgent & Mechanize August 1, 2004, 5:44 am
tricks against WWW::Mechanize April 10, 2005, 6:48 pm
Understanding Mechanize August 19, 2005, 4:23 am
WWW::Mechanize issue November 15, 2005, 7:18 pm
using perl mechanize January 10, 2006, 5:12 pm
selenium with www::mechanize September 12, 2006, 6:52 am
Mechanize location October 8, 2006, 10:17 pm
www::mechanize and forms November 5, 2006, 4:47 pm
WWW::Mechanize question July 5, 2007, 2:37 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap