CGI module and UTF-8

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

We had already a discussion about the usage of the CGI module if the
interaction is to be in UTF-8 code; see message
messages before and after. I summarise:

If output to the form (that is, to STDOUT) and input from the form (that
is, from STDIN) use different encoding, the last input values cannot be
used as next default values as is normally the case when the CGI module is

One way to get over to is to use binary I/O in both directions and do
*all* encoding/decoding in the script. For instance, when option
"Location" shall be defaulted to "München", one has to write

  $Muenchen = encode ('utf8', 'München');
  $cgi->textfield(-name =>'Location', -value => $Muenchen, -size => 40)

The simpler

  $cgi->textfield(-name =>'Location', -value => 'München', -size => 40)

would not work because the letter 'ü' cannot be output in binary mode.

Another way is to convince the CGI module to consistently use UTF-8 code.
There are two ways to do that:

 - use one of binmode(STDIN,":utf8") or binmode(STDIN,":encoding(utf8)")

   in order that the conversion from a byte string to a text string be done
   already during input

 - use the -utf8 pragma of the CGI module

Now the documentation says:

| -utf8
| This makes treat all parameters as UTF-8 strings. Use this with care,
| as it will interfere with the processing of binary uploads. It is better to
| manually select which fields are expected to return utf-8 strings and
| convert them using code like this:
|    1. use Encode;
|    2. my $arg = decode utf8=>param('foo');

The last line looks easy but is in fact cumbersome: you need to go through all
parameters and *set* the parameter to its decoded value. Otherwise the
decoding is not used when the value is reused as default for the next
iteration of the form.

I have done several tests in the meantime, and I feel that it is just the
other way round: fiddling with the binmode of STDIN causes problems with
binary uploads (for ":encoding(utf8)" this is not surprising, but simple
unchecked ":utf8" doesn't work either); however, using the -utf8 pragma does
not! If behaves as one would have expected: obviously the analysis of the
input file is done prior to all encoding considerations and the effect of the
-utf8 pragma takes place *after* the binary data has already been removed or
recognised as already processed.  At least, I was not able to corrupt any
binary file data uploaded together with UTF-8 coded form data using the -utf8

This would be great news for people with internationalised websites: the
CGI module is not restricted to 256-character codes. Why "*would* be great
news"? As long as the documentation warns about it, one cannot use this
valuable feature with a good conscience.

Or are there reasons not to use the -utf8 pragma when binary data are

Helmut Richter

Site Timeline