utf-8 of a string

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I am asking part one of my questions here because I do not know where
else to ask it.
   I have been sent a string in some language whose alphabet
   is not known to me. How can I find utf-8 representation
   of this string?

Part two of my doubts is related to perl. But I haven't really got
around to grappling with it because right now, part one is an obstacle
to me.

The next step in part two would be to read a CLOB field in a database,
and grep it in a perl script to check whether the above string appears
in it. I can read this CLOB field and write it to an excel sheet, and
the excel sheet shows the CLOB data to me. I would like to check
whether this CLOB data contains the string which has been sent me.

Is there any forum-FAQ or code where I can find some answers/pointers
to my questions?
Please advise.

Re: utf-8 of a string

On Tue, 28 Jun 2011, dn.perl@gmail.com wrote:

Quoted text here. Click to load it

This can hardly be answered without knowing what kind of alphabet or code
the data is. For converting, there are ready-made tools (e.g. recode), but
programming your own in perl using the module Encode is not difficult, and
is more adaptable to the ideosyncrasies of your data which may not exactly
adhere to one code. Before you can use Encode reasonably, you must have
understood the difference between characters and bytes in perl, which is
explained in perlunitut.

Quoted text here. Click to load it

Formerly, there was comp.std.internat dealing, among other things, with
such issues. Due to very little traffic, the group has been closed down by
the usenet bigwigs, so there is no longer an appropriate group AFAIK.

Could you make one of your files publicly accessible (e.g. Web) and tell
here shortly what the problem is? It will be off-topic here, but if the
thread remains short it is less of a nuisance than longish discussions
where it should go.

Helmut Richter

Re: utf-8 of a string

Quoted text here. Click to load it

Simple. By using one of the modules that do encoding conversions. Just
use whatever encoding the string is in before and then convert it into
UTF-8. Years ago I have used Text::IConv very successfully.

You don't know the encoding of your string? Well, then you have a real
problem and can stop right there. Guessing an encoding based on the mere
binary data is an AI project.
I seem to remember that years ago someone mentioned a module that
heuristically guesses which encoding (and language?) a particular byte
string may have. But even in the best of cases that is just a guess.

Quoted text here. Click to load it

The index() function will do that perfectly fine:

    index STR,SUBSTR
          The index function searches for one string within another[...]

Quoted text here. Click to load it

The Perl FAQ is part of any standard Perl installation and is sitting
right there on your hard drive, see "perldoc perldoc". You are looking
for the perlfaq page or the -q option.


Re: utf-8 of a string

Quoted text here. Click to load it

This is a standard application of n-grams / trigrams. A well-trained
trigram database can make a very good guess to the encoding and
language of your input text. Training the thing is the challenge.
There might be research databases available, I just noticed a Google
Labs one and a Microsoft Research one in a search, but I don't know
how easy they are to use.

Quoted text here. Click to load it

Text::NSP (N-Gram Statistics Project) seems relevant, but I haven't
used it.

did n-gram analysis of all his email for a few years to ID language

Re: utf-8 of a string

Quoted text here. Click to load it

Do you really have just a single string in an unknown alphabet?
If so, why is asking whoever sent it to you not a solution?

Or are you expecting to be sent multiple strings?  Are they all going
to be in the same alphabet, or are they going to be in different
alphabets?  What assumptions, if any, can you reasonably make?
Are there any limitations on the possible set of encodings?

How exactly are these strings represented?  Where are they coming
from?  Why isn't whatever entity is providing them to you also
telling you how to interpret them?

There is no basis in the information you've given us for guessing
how to decode these strings.  They may be in the information you
haven't given us.

Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

Site Timeline