Random string from selected Unicode character set (test data)

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I am implementing the script from generatedata.com

But I would like for it to also display Unicode chars, so I can test
other languages.

I have looked everywhere, but can't seem to find a PHP function that
lets me do something like

$outstring .= $make_unicode_random('katakana');

thereby selecting things from the code points U+30A0 .. U+30FF


The strings could be defined more broadly like "japanese" etc for
language separation. Not important.

Point is to get a string of 1..n chars of a certain language group, be
it Greek, European accents, Japanese, Mandarin, Hangul etc

I have found a project "babel" which is a .NET application. Not sure
if it's open sourced.

Anyone have some pointers to this project ?

Re: Random string from selected Unicode character set (test data)

Horst Lemminger wrote:
Quoted text here. Click to load it
I would simply make an indexed array of all the characters you want to
select from, and then randmonly index into that.

I've done similar in the past by writing a program to write the source
code of a large lookup table.

Sometimes the 'table' approach is just simpler (and much faster) than an
algorithmic approach  unless you are REALLY strapped for code/static

Which is unlikely to be the case with a typical LAMP type installation.

Quoted text here. Click to load it
use separate tables for each character set..

Quoted text here. Click to load it
Not me

To people who know nothing, anything is possible.
To people who know too much, it is a sad fact
that they know how little is really possible -
and how hard it is to achieve it.

Re: Random string from selected Unicode character set (test data)

Quoted text here. Click to load it

First, get a function that takes a codepoint and generates the UTF
bytes for that codepoint.  See below.  I chose to put the data in
urlencode format (e.g. "%E3%82%A0" for U+30A0) and then urldecode()
it to a raw byte sequence.  It may not be super-efficient but it
works.  Note that the codepoint probably needs to be a hex number
(e.g. begins with "0x") although decimal works also.  UTF-8 encoding
is essentially a lot of bit shifting and masking.

This function is great as a base for generating code charts (that
is, *ALL* of the characters in a certain range, in a table labelled
with the codepoint, to test out your fonts).

Second, make some tables so that, given a language, you can determine
some code point range(s) which contain characters from that language.
The unicode.org tables of blocks of characters and what scripts
they belong to may be useful here.

Third, make a function that picks some random code points from a
particular language range and outputs them.

Be sure to declare the character set of your web page (if that's
where the output is going) as UTF-8 in the HTTP headers:

    header('Content-type: text/html; charset=UTF-8');

Quoted text here. Click to load it


function codepoint_to_utf8($code)
    if ($code <= 0x7f) {
        $s = sprintf("%%%02X", $code);
    } else if ($code <= 0x7ff) {
        $s = sprintf("%%%02X%%%02X",
            ((($code >>  6) & 0x1f)) | 0xc0,
            ((($code >>  0) & 0x3f)) | 0x80
    } else if ($code <= 0xffff) {
        $s = sprintf("%%%02X%%%02X%%%02X",
            ((($code >> 12) & 0x0f)) | 0xe0,
            ((($code >>  6) & 0x3f)) | 0x80,
            ((($code >>  0) & 0x3f)) | 0x80
    } else if ($code <= 0x1fffff) {
        $s = sprintf("%%%02X%%%02X%%%02X%%%02X",
            ((($code >> 18) & 0x07)) | 0xf0,
            ((($code >> 12) & 0x3f)) | 0x80,
            ((($code >>  6) & 0x3f)) | 0x80,
            ((($code >>  0) & 0x3f)) | 0x80
    } else if ($code < 0x3ffffff) {
        /* actually, this is beyond legal Unicode */
        $s = sprintf("%%%02X%%%02X%%%02X%%%02X",
            ((($code >> 24) & 0x03)) | 0xf8,
            ((($code >> 18) & 0x3f)) | 0x80,
            ((($code >> 12) & 0x3f)) | 0x80,
            ((($code >>  6) & 0x3f)) | 0x80,
            ((($code >>  0) & 0x3f)) | 0x80
    } else if ($code < 0x7fffffff) {
        /* actually, this is beyond legal Unicode */
        $s = sprintf("%%%02X%%%02X%%%02X%%%02X%%%02X",
            ((($code >> 30) & 0x01)) | 0xfc,
            ((($code >> 24) & 0x3f)) | 0x80,
            ((($code >> 18) & 0x3f)) | 0x80,
            ((($code >> 12) & 0x3f)) | 0x80,
            ((($code >>  6) & 0x3f)) | 0x80,
            ((($code >>  0) & 0x3f)) | 0x80
    return urldecode($s);

                UTF-8 Character set (U+0100 - U+01FF)
        _0  _1  _2  _3  _4  _5  _6  _7  _8  _9  _A  _B  _C  _D  _E  _F

   10_   Ā   ā   Ă   ă   Ą   ą   Ć   ć   Ĉ   ĉ   Ċ   ċ   Č   č  
Ď   ď
   11_   Đ   đ   Ē   ē   Ĕ   ĕ   Ė   ė   Ę   ę   Ě   ě   Ĝ   ĝ  
Ğ   ğ
   12_   Ġ   ġ   Ģ   ģ   Ĥ   ĥ   Ħ   ħ   Ĩ   ĩ   Ī   ī   Ĭ   ĭ  
Į   į
   13_   İ   ı   IJ   ij   Ĵ   ĵ   Ķ   ķ   ĸ   Ĺ   ĺ   Ļ   ļ   Ľ  
ľ   Ŀ
   14_   ŀ   Ł   ł   Ń   ń   Ņ   ņ   Ň   ň   ʼn   Ŋ   ŋ   Ō   ō  
Ŏ   ŏ
   15_   Ő   ő   Œ   œ   Ŕ   ŕ   Ŗ   ŗ   Ř   ř   Ś   ś   Ŝ   ŝ  
Ş   ş
   16_   Š   š   Ţ   ţ   Ť   ť   Ŧ   ŧ   Ũ   ũ   Ū   ū   Ŭ   ŭ  
Ů   ů
   17_   Ű   ű   Ų   ų   Ŵ   ŵ   Ŷ   ŷ   Ÿ   Ź   ź   Ż   ż   Ž  
ž   ſ
   18_   ƀ   Ɓ   Ƃ   ƃ   Ƅ   ƅ   Ɔ   Ƈ   ƈ   Ɖ   Ɗ   Ƌ   ƌ   ƍ  
Ǝ   Ə
   19_   Ɛ   Ƒ   ƒ   Ɠ   Ɣ   ƕ   Ɩ   Ɨ   Ƙ   ƙ   ƚ   ƛ   Ɯ   Ɲ  
ƞ   Ɵ
   1A_   Ơ   ơ   Ƣ   ƣ   Ƥ   ƥ   Ʀ   Ƨ   ƨ   Ʃ   ƪ   ƫ   Ƭ   ƭ  
Ʈ   Ư
   1B_   ư   Ʊ   Ʋ   Ƴ   ƴ   Ƶ   ƶ   Ʒ   Ƹ   ƹ   ƺ   ƻ   Ƽ   ƽ  
ƾ   ƿ
   1C_   ǀ   ǁ   ǂ   ǃ   DŽ   Dž   dž   LJ   Lj   lj   NJ   Nj   nj   Ǎ  
ǎ   Ǐ
   1D_   ǐ   Ǒ   ǒ   Ǔ   ǔ   Ǖ   ǖ   Ǘ   ǘ   Ǚ   ǚ   Ǜ   ǜ   ǝ  
Ǟ   ǟ
   1E_   Ǡ   ǡ   Ǣ   ǣ   Ǥ   ǥ   Ǧ   ǧ   Ǩ   ǩ   Ǫ   ǫ   Ǭ   ǭ  
Ǯ   ǯ
   1F_   ǰ   DZ   Dz   dz   Ǵ   ǵ   Ƕ   Ƿ   Ǹ   ǹ   Ǻ   ǻ   Ǽ   ǽ  
Ǿ   ǿ

Site Timeline