Click here to get back home

HTTP::Request::Common::POST and UTF-8

 HomeNewsGroups | Search | About
 comp.lang.perl.modules    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
HTTP::Request::Common::POST and UTF-8 Stephen Collyer 09-27-2005
Posted by Stephen Collyer on September 27, 2005, 12:21 pm
Please log in for more thread options


I'm passing a string containing UTF-8 to HTTP::Request::Common::POST
and the UTF-8 seems to be destroyed during the encoding required
for application/x-www-form-urlencoded. (I get 4 UTF-8 chars encoded
to two spaces, AFAICS).

Is this a known problem with LWP ?

Any suggestions for a quick fix ?

Steve Collyer


Posted by Alan J. Flavell on September 27, 2005, 12:56 pm
Please log in for more thread options


On Tue, 27 Sep 2005, Stephen Collyer wrote:

> I'm passing a string containing UTF-8 to HTTP::Request::Common::POST
> and the UTF-8 seems to be destroyed during the encoding required
> for application/x-www-form-urlencoded.

I don't know the answer to your question, but, in principle the web
specifications say that application/x-www-form-urlencoded is only
guaranteed to support us-ascii. You and I know, in practical terms,
that it may not be as bad as that, and when executing a GET we'd have
no other choice; but if you're using POST rather than GET, then you
might be advised to use multipart/form-data instead.

Have you proved that the transaction that you're trying to carry out
can be successfully initiated "by hand" or from a web browser, before
you try to implement it from LWP? Just to be sure you're looking in
the right place for the problem, I mean.

> Is this a known problem with LWP ?

If I was confronted with this problem, I'd write a short test-case to
investigate what was happening. If the test didn't reveal the problem
to me, I'd consider posting the complete code here.

Does Perl know that this is a utf-8 text string i.e in the sense of
the Unicode support that is in Perl 5.8+ versions? Or are you handing
it around as binary, or what?

Sorry I can't be of more help "off the top of my head".


Posted by Stephen Collyer on September 27, 2005, 7:08 pm
Please log in for more thread options


Alan J. Flavell wrote:

> I don't know the answer to your question, but, in principle the web
> specifications say that application/x-www-form-urlencoded is only
> guaranteed to support us-ascii. You and I know, in practical terms,
> that it may not be as bad as that, and when executing a GET we'd have
> no other choice; but if you're using POST rather than GET, then you
> might be advised to use multipart/form-data instead.

1. Yes, I've read your nice web page on the matter, so AFAICS it
should be possible

2. I'm currently constrained to application/x-www-form-urlencoded
but, yes, it may make more sense to use multipart/form-data.

> Have you proved that the transaction that you're trying to carry out
> can be successfully initiated "by hand" or from a web browser, before
> you try to implement it from LWP? Just to be sure you're looking in
> the right place for the problem, I mean.

Yes, this is working code that I'm reworking for UTF-8 support.
So I know precisely where the problem is.

> If the test didn't reveal the problem
> to me, I'd consider posting the complete code here.

I've investigated to the point that I can see that the problem
seems to occur at line 53 of URI::_query::query_form:

53:b $self->query(@query ? join('&', @query) : undef);

This routine escapes the data in the POST content array, and
all seems well up to line 53 where it sets the content of the
query. When I look at $self->query(), all UTF-8 chars seem to have
been converted to +. This looks bizarre as it's only doing a join.

I need to investigate this further - it should be easy enough to
cook up a small example to reproduce if it is indeed a bug.

This is using perl, v5.8.3

>
> Does Perl know that this is a utf-8 text string i.e in the sense of
> the Unicode support that is in Perl 5.8+ versions? Or are you handing
> it around as binary, or what?

Yes, these are marked as UTF-8 according to Encode::is_utf8.

Steve Collyer


Posted by Alan J. Flavell on September 27, 2005, 9:31 pm
Please log in for more thread options


On Tue, 27 Sep 2005, Stephen Collyer wrote:

> I've investigated to the point that I can see that the problem
> seems to occur at line 53 of URI::_query::query_form:
>
> 53:b $self->query(@query ? join('&', @query) : undef);

Please understand that I'm thinking aloud here: I don't have the
answer, but, as no-one else has stepped in, I thought my ponderings
might just be helpful.

Hmmm, the version of _query.pm that I'm looking at here (which
might be old) invokes URI::Escape::escapes

Looking at http://search.cpan.org/~gaas/URI-1.35/URI/Escape.pm
it appears there are two different functions, for escaping in an
8-bit context and for escaping in a utf8 context. As it says, they
produce different results, even for the characters from 128-255.

However, if I look at the URI/Escape.pm that's installed hereabouts,
it describes itself as Revision 3.21, and shows no sign of being
capable of escaping any character above 255.

> This routine escapes the data in the POST content array, and
> all seems well up to line 53 where it sets the content of the
> query.

Seems to me that one needs to take a look whether there's any
machinery, in the version that you're using, for invoking the
utf8-context escapes, and, if so, how to trigger it. I'm not by any
means certain that the mere utf8-ness of a string would be the right
lever to trigger this, to be honest.

> When I look at $self->query(), all UTF-8 chars seem to have
> been converted to +. This looks bizarre as it's only doing a join.

My hunch is that they've been offered to a routine that can only
escape the characters 0-255.

hope this is vaguely useful at least.


Posted by Stephen Collyer on September 28, 2005, 1:02 pm
Please log in for more thread options


Alan J. Flavell wrote:

>>When I look at $self->query(), all UTF-8 chars seem to have
>>been converted to +. This looks bizarre as it's only doing a join.
>
>
> My hunch is that they've been offered to a routine that can only
> escape the characters 0-255.

AFAICS this isn't the case, but I may be wrong.

For what it's worth, the following regex on line 16 of package
URI::_query seems to be causing the problem.

$q =~ s/([^$URI::uric])/$URI::Escape::escapes/go;

I've no idea what's going on exactly but it looks to me
like the escaping is occuring twice for some reason, with
the UTF8 intact the first time, but destroyed the second time,
after passing through the substitution above.

Steve Collyer



Similar ThreadsPosted
troubles with HTTP::Request::Common March 8, 2005, 11:42 pm
Help regarding HTTP::Request:POST February 24, 2006, 11:31 pm
http request headers October 1, 2004, 12:47 pm
How can I make HTTP::Request handle gzipped content? September 3, 2004, 2:11 am
Parse tcpdump for HTTP Request Response Headers July 29, 2007, 2:10 pm
Net::Analysis Parse tcpdump for HTTP Request/Response Headers July 29, 2007, 6:41 am
XML::LibXML::Common does not install Common.pm November 27, 2007, 11:21 pm
HTTP::Request::Form - Problem pressing input type=image button February 1, 2005, 7:56 am
Problem with DBD::DB2 and UTF8. April 14, 2006, 11:31 am
UTF8 on DBI with Perl April 1, 2007, 11:30 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap