Telling Unicode and real & characters apart.

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi there. I've written a simple program that makes a simple GET form
with a text input box and displays $_GET["foo"] when submitted.

Using Windows Character Map, I pasted in the Cyrillic capital "Ya" (the
backward R) and it came out as "Я". So far so good.

Then I sent in "[R] Я" (The [R] is the Cyrillic character again.)

That came out as "Я Я". How can I please tell the
difference between the Cyrillic and the character sequence '&', '#',

It seems to me that the '&' character should be transformed into
"&" just like the Cyrillic characters. Perhaps I have misunderstood
something along the way.


Re: Telling Unicode and real & characters apart.

Louise GK wrote:

Quoted text here. Click to load it

The recommendation seems to be to UTF-8-ise.


Re: Telling Unicode and real & characters apart.

Quoted text here. Click to load it

 What encoding is the page with the form in?

 Some browsers will, if the page is in an encoding that does not contain the
character being pasted in, convert the character to an HTML character entity -
this is then indistinguishable from pasting the character entitity itself in.

 Try the code below (filename: form_encoding.php), pasting a Ya followed by the
literal text "Я" into the input box.  

 Note what happens when you switch page encodings and resubmit the text;
iso-8859-15 doesn't contain a Ya, so the browser tries to make the best of an
impossible situation and sends the HTML character entity representation

 The other two encodings, utf-8 and iso-8859-5 do contain Ya, so you get the
correct behaviour, i.e a Ya, and the text of the HTML entity.

$encoding = isset($_GET['encoding']) ? $_GET['encoding'] : 'iso-8859-15';
header("Content-type: text/html; charset=$encoding");
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
" ">
 <title>form encoding</title>
<form method="get" action="form_encoding.php">
 <input type="radio" name="encoding" value="iso-8859-15"
  <label for="encoding-iso-8859-15">iso-8859-15 (Western European)</label><br>

 <input type="radio" name="encoding" value="utf-8" id="encoding-utf-8">
  <label for="encoding-utf-8">utf-8 (Unicode)</label><br>

 <input type="radio" name="encoding" value="iso-8859-5"
  <label for="encoding-iso-8859-5">iso-8859-5 (Cyrillic)</label><br>

 <input type="submit" value="Set Encoding">

<p>Encoding: <?php print $encoding; ?></p>

<form method="get" action="form_encoding.php">
 <input type="hidden" name="encoding" value="<?php print
 <input type="text" name="input">
 <input type="submit">
if (isset($_GET['input']))
    print htmlspecialchars($_GET['input'], ENT_QUOTES, $encoding);

Andy Hassall :: :: :: disk and FTP usage analysis tool

Site Timeline