Fun With Unicode

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I had a list of names, with added detritus, which I wanted to clean-up
and collate alphabetically. However, some of these names have some
characters which cannot be expressed in ASCII or ISO-8859-1, and
indeed I found out that the text in question is encoded in UTF-8.
For test purposes, here is a drastically shortened version of the

     Ed Gooz Unblock
     Emirjon Fishta Unblock
     Nathan Gutierrez Unblock
     Amanda Alatti Unblock
     Yuri Aleksei Carrión Belliard Unblock
     Zackry Wallace-Bell Unblock
     Collin Tierney Unblock
     Frederic Moseley Jr. Unblock
     İrfan Qureyş Unblock
     Arthur Vullamparthi Unblock
     Kate Onthetimeline Unblock
     Mary Elizabeth Blackley Unblock
     Lisa Lauchstedt Unblock
     Padraic O'Driscoll Unblock
     Ragu PG Unblock
     Tammy Houghtaling Unblock
     Sifokl AlSifokli Unblock
     Nékoé Mīkûriá Unblock
     John Froex Unblock
     Chasity Ahmad Unblock

So I wrote a script to clean and collate that list.

But my first few attempts were *NOT* successful, so I had to
work to find solutions for some problems.

Firstly, I discovered that the first line of the original file
started with "Byte Order Mark" or "BOM", which was "\x".
My attempts to remove that were failing. But after doing some
research I discovered that the representation of the BOM
inside Perl is *NOT* the same as the bytes in the file. A Unicode
BOM at the beginning of a file is EFBBBF, but the internal
representation in Perl is "\x", or "\N" for short.

Then I struggled with pattern matches to end-of-line that weren't
working, before I realized each line had an invisible "\x0d" on
the end of it, because chomp was removing the "\x0a" from "\x0d0a"
and leaving the "\x0d" behind.

Then I struggled with the name "İrfan Qureyş" being sorted to the
end instead of between H and J where it belongs, until I realized
that Perl was sorting by codepoint ordinal instead of alphabetically.
But then I discovered Unicode::Collate.

Having cleared those issues up, my script looks like this:

#! /usr/bin/perl
use v5.14;
use strict;
use warnings;
use open qw( :encoding(utf8) :std );
use Unicode::Collate;
use charnames ":short";
sub process_line (_) {
    s/[\x0a\x0d]+$//;         # Get rid of newline (Windows OR Unix).
    s/^\N//;             # Get rid of BOM at start of line (if any).
    s/\s*(.+)\s+Unblock$/$1/; # Get rid of leading/trailing space & junk.
    $_ .= "\x0a";             # Add Unix-style newline.
print Unicode::Collate->new->sort( map <> );

and the cleaned & sorted version of the name list above looks like this:

Amanda Alatti
Arthur Vullamparthi
Chasity Ahmad
Collin Tierney
Ed Gooz
Emirjon Fishta
Frederic Moseley Jr.
İrfan Qureyş
John Froex
Kate Onthetimeline
Lisa Lauchstedt
Mary Elizabeth Blackley
Nathan Gutierrez
Nékoé Mīkûriá
Padraic O'Driscoll
Ragu PG
Sifokl AlSifokli
Tammy Houghtaling
Yuri Aleksei Carrión Belliard
Zackry Wallace-Bell

Everything is sorted to the right place, even "İrfan Qureyş".

As always, I'm open to better ways of implementing things.
Does anyone see any improvements that could be made to
the above script?

Robbie Hatley
Midway City, CA, USA
perl -le 'print "4o6e7o4f0w5llc7m"'

Re: Fun With Unicode

Quoted text here. Click to load it

Windows programs have the annoying habit of writing a BOM at the start
of UTF-8 encoded files even though there is no byte order that would

Quoted text here. Click to load it


Yes. BOM is U+FEFF.  

UTF-8 is a way to map Unicode code points (e.g. U+FEFF) to sequences of
bytes (e.g. EF BB BF). The mapping has a number of desirable properties
(e.g. characters U+00 .. U+007F are encoded as a single byte and
compatible with US-ASCII, sequences are self-terminating, partial
sequences are detectable), but it is not a byte-by-byte mapping of a
binary representation of the Unicode code point. You can't just
concatenate the hex values of some UTF-8 bytes to get a Unicode code
point, you have to compute it properly. See


   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) |                    | Man feilt solange an seinen Text um, bis
| |   |         | die Satzbestandteile des Satzes nicht mehr
__/   | | zusammenpaßt. -- Ralph Babel

Re: Fun With Unicode

Quoted text here. Click to load it

But noot appropriately.

Quoted text here. Click to load it

No, that the transform, note the code page, for U+FEFF "ZERO WIDTH

Quoted text here. Click to load it

What gives you that idea? It has never been and never will be.

Quoted text here. Click to load it

That's because you're using a system for which /n is LF (0a), not CR

See 6.  Byte order mark (BOM) in RFC 3629.

Shmuel (Seymour J.) Metz, SysProg and JOAT  <

Unsolicited bulk E-mail subject to legal action.  I reserve the
right to publicly post or ridicule any abusive E-mail.  Reply to
domain Patriot dot net user shmuel+news to contact me.  Do not
reply to

Re: Fun With Unicode

Quoted text here. Click to load it

Yes, you're dealing with a complete idiocy: a UTF-8 encoded BOM.

UTF-8 doesn't require a byte order marker, because it is specified down
to the bit level; there is no choice of byte order. The most significant
part of a code point is encoded in the earliest UTF-8 byte.

Re: Fun With Unicode

On 4/14/2015 1:16 PM, Kaz Kylheku wrote:

Quoted text here. Click to load it

For some reason Bill Gates & Company thought they knew better,
so they made Notepad and other MS software add a BOM to the beginning
of every utf8-encoded text file. Go figure.

Perhaps they're using it as a marker saying "This is a Unicode file"?

Robbie Hatley
Midway City, CA, USA
perl -le 'print "4o6e7o4f0w5llc7m"'

Site Timeline