Multiple coding systems, and filesystems

On some of my course pages, I quote (with attribution)
small sections of Wikipedia and the like.  E.g, the top

has "entropia" in Greek font,

has the o-umlaut from German, and

has a Japanese font.  What is the correct --maybe "coding
system" is the term?-- so that I could quote all three of
these on the same HTML page?

And can the HTML-page be set up so that it will validate?

Actually, I'm ahead of myself.  In the past I've cut&pasted
a snippet from, say, wiki/entropy, into an Emacs buffer,
adjoined a "From Wictionary http://..." and attempted to
save the buffer.  Sometimes Emacs asked me for what coding
system to use --and I don't know how to placate it.

If I'm using multiple coding systems on the same webpage,
do I have to save the different snippets in different files
stored with different coding systems, and then

        <!--#include ...  -->

each of them into one webpage?  Or can the file system
permit a file that simultaneously has Greek, German and
Japanese characters?

FWIW, my home OS is MacOSX and I need to upload my webpages
to school.   The math dept. server is probably running
Unix; when I manipulate the html files (when at work), I'm
using Emacs running on a Solaris (unix) system.

  Prof. Jonathan King  (gentsquash)
  Mathematics dept, Univ. of Florida

Re: Multiple coding systems, and filesystems

Tue, 3 Jun 2008 14:08:25 -0700 (PDT), /

Files generally store bytes.  How these bytes will be interpreted is
up to the application reading them.  Characters are encoded into
bytes using different coding schemes which generally are capable of
representing the characters of a specific character set.  The
Unicode character set generally contains all possible characters so
if you use some UTF (Unicode Transformation Format) variant you can
have all characters you need encoded in a single entity.  So make
sure your text editor supports reading/saving files using UTF-8, for


Re: Multiple coding systems, and filesystems


Technically, it has the word in Greek _characters_ (letters). This is
the key issue; fonts are secondary. The page has a style sheet that
makes special suggestions on the font of such words, in a most confusing
and tricky way.

The proper _character encoding_ is UTF-8 in such cases. As soon as you
have Japanese, Greek, and umlaut Latin letters on one page, that's
definitely the best option. If there were just a few "special"
characters, you could present them using entity references like &ouml;
or character references like &#261;, but this gets clumsy (or requires
suitable software for generating them) if you have full sentences that
consist of "special" characters.

It's not possible (in practice on web pages) to switch the character
encoding in the middle of an HTML document.

UTF-8, if Emacs can really produce it. The version of Emacs I've been
using does not deal with "special" characters, but I recently looked at
the newest version of Emacs for Windows, and it seems to have an
impressive support to "special" characters.

Note that the server should be configured to send an appropriate HTTP
header. You normally do this by adding something to your .htaccess file,
and in practice you need to use the same encoding for all ".html" files
in a directory (folder), though you could use, for example, ISO-8859-1
for ".html" and UTF-8 for ".htm" files.

No, it won't work that way, even if your server supports SSI includes.
They result in a single document, which can have one encoding only. (I
won't mention <iframe>, because it's really a poor hack for things like
this, but it performs sort-of include where the included document is
displayed "autonomously" inside the main canvas and may have a different

A nice mess :-) but it should be manageable when using UTF-8. When
uploading with FTP, use binary (not Ascii) mode, since no character
conversion shall be performed - the data is already in a
system-independent encoding.

Jukka K. Korpela ("Yucca")

Re: Multiple coding systems, and filesystems

On Wed, 4 Jun 2008, Jukka K. Korpela wrote:

A better idea is to separate content-type and charset.
For example, use "utf8" for UTF-8 and "iso1" for ISO-8859-1.
On Apache, you can write into your .htaccess file:

  Options      +Multiviews
  DefaultType  text/html
  AddCharset   iso-8859-1  iso1
  AddCharset   utf-8       utf8

Name the files as "mypage.html.iso1" and "anotherpage.html.utf8"
or simply as "mypage.iso1" and "anotherpage.utf8";
and don't forget "stylesheet.css.utf8".

In the URLs, omit ".iso1" and ".utf8" of course:

  <a href="mypage.html">
  <a href="anotherpage.html">

/* One wonders if you need ISO-8859-1 at all
when you can have documents in UTF-8. */

Re: Multiple coding systems, and filesystems

On Tue, 3 Jun 2008, wrote:

Use Unicode in the encoding ("charset") UTF-8:

Choose UTF-8 for the web.

Yes - with Unicode.

Either use a UTF-8 locale such as

  export  LC_ALL="en_US.UTF-8"
  export    LANG="en_US.UTF-8"

or write all non-ASCII characters as character references

