DBD::Oracle, Unicode, non-UTF8-non-ASCII strings

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!


I am using DBD::Oracle with a client and database characterset of AL32UTF8.

My problem is that DBD::Oracle assumes utf8-encoded strings in prepare().  
If the input strings are not utf8, and contain non-ASCII characters, then  
they are *not* automatically converted to utf8 by DBD::Oracle. They are  
offered to Oracle as if they were utf8, and Oracle processes them even if  
they are not valid, maybe even returning them to Perl as such; Perl is  
then left with invalid utf8-flagged strings simply actually latin1  

To be very clear: this is *not* an issue of wrong character encodings  
during any time before the calls to DBI/DBD::Oracle (though it is,  

I seem to remember, although I cannot find it now, that this behaviour  
also was well documented in DBD::Oracle or at some other place, so I am  
not so much complaining about DBD::Oracle (though I'm a bit miffed - this  
would seem to be trivially fixable in there...).

Here is an example:

     my $sql="select 'a \xe4 a' from dual";    # \xe4 is the german "a  
     diag(DBI::data_string_desc($sql));   # UTF8 off, non-ASCII, 24  
characters 24 bytes
     my $sth=$dbh->prepare($sql);
     my ($rc)=$sth->fetchrow();
     diag("is_utf8: ".utf8::is_utf8($rc)." valid:  
'".utf8::valid($rc)."'");  # is_utf8: 1 valid: ''
     diag(DBI::data_string_desc($rc));   # UTF8 on but INVALID encoding,  
ASCII, 3 characters 5 bytes

The output:

     # UTF8 off, non-ASCII, 24 characters 24 bytes
     # is_utf8: 1 valid: ''
     Malformed UTF-8 character (unexpected non-continuation byte 0x20,  
immediately after start byte 0xe4) in pattern match (m//) at  
/usr/lib/perl5/site_perl/5.8.3/x86_64-linux-thread-multi/DBI.pm line 1117.
     # UTF8 on but INVALID encoding, ASCII, 3 characters 5 bytes

This also highlights one nastiness of the situation: no error or warning  
whatsoever about the situation is displayed by the usual  
prepare/execute/fetch calls. The "Malformed ..."-line above is from inside  
DBI::data_string_desc, where the broken string is used in a regexp  
(regexp's seem to be one place where this is noticed). So it's purely by  
chance that there is a notice about this problem in this test script.

What to do? I have dozens, maybe 100 scripts that would be perfectly  
utf8-capable except for this issue. They use proper :encoding() on their  
file handles etc. . Is there some flag I can activate in DBD::Oracle or  
DBI which automatically recodes "UTF8 off, non-ASCII" strings to utf8? A  
simple Encode::decode("utf8",$sql) fixes the issue of course, but I am not  
really looking forward to changing all those scripts, maybe missing lots  
of stuff along the way or introducing errors.

We have a module that handles low level DB stuff like opening the DB  
handle. So if there was a fix which could take place *once* after opening  
the DB, that would be great.


    $DBD::Oracle::VERSION = '1.23';
    $DBI::VERSION = "1.52";
    Oracle 10

Thanks in advance!

Site Timeline