<!doctype debiandoc public "-//DebianDoc//DTD DebianDoc//EN"
[
   <!entity % languages system "languages.ents"> %languages;
   <!entity % examples system "examples.ents"> %examples;
]>
<debiandoc>
<book>


<titlepag>
  <title>Introduction to i18n</title>
  <author>
    <name>Tomohiro KUBOTA</name>
    <email>kubota@debian.org</email>
  </author>
  <version><date></version>
  <abstract>
    This document describes basic concepts for i18n
    (internationalization), how to write an internationalized
    software, and how to modify and internationalize a software.
    Handling of characters is discussed in detail.
    There are a few case-studies in which the author internationalized
    softwares such as TWM and so on.
  </abstract>
  <copyright>
    <copyrightsummary>
      Copyright &copy; 1999-2001 Tomohiro KUBOTA.
      For chapters and sections whose original author is not KUBOTA,
      the authors of them have copyright.  Their names are written
      at the top of the chapter or the section.
    </copyrightsummary>
    <p>
      This manual is free software; you may redistribute it and/or modify it
      under the terms of the GNU General Public License as published by the
      Free Software Foundation; either version 2, or (at your option) any
      later version.
    </p>
    <p>
      This is distributed in the hope that it will be useful, but
      <em>without any warranty</em>; without even the implied warranty of
      merchantability or fitness for a particular purpose.  See the GNU
      General Public License for more details.
    </p>
    <p>
      A copy of the GNU General Public License is available as
      <tt>/usr/share/common-licenses/GPL</tt> in the Debian GNU/Linux 
      distribution or on the World Wide Web at 
      <url id="http://www.gnu.org/copyleft/gpl.html">.
      You can also obtain it by writing to the Free
      Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
      02111-1307, USA.
    </p>
  </copyright>
</titlepag>

<toc detail="sect1">

<chapt id="scope"><heading>About This Document</heading>

<sect id="scope2"><heading>Scope</heading>

<P>
This document describes the basic ideas of I18N written for
programmers and package maintainers of Debian GNU/Linux and
other UNIX-like platforms.
The aim of this document is to offer an introduction to 
basic concepts, character codes, and points of which care
should be taken when one writes an I18N-ed software or
an I18N patch for an existing software.  There are many
know-hows and case-studies on internationalization of
softwares.  This document also tries to introduce the
real state and existing problems for each language and country.
</P>

<P>
Minimum requirements, for example, 
that characters should be displayed with fonts with
proper charset (at least users of the software must be 
able to guess what is written), 
that characters must be inputed from keyboard, and
that softwares must not destroy characters, 
are stressed in the document and I am trying to 
describe a HOWTO to satisfy these requirements.
</P>

<P>
This document is strongly related to programming 
languages such as C and standardized I18N methods such as
LOCALE and <prgn>gettext</prgn>.
</P>

<sect id="newversion"><heading>New Versions of This Document</heading>

<P>
The current version of this document is available
at 
<url id="http://www.debian.org/doc/ddp" 
name="DDP (Debian Documentation Project)"> page.
</P>

<p>
Note that the author rewrite this document in November 2000.
</p>

<sect id="feedback"><heading>Feedback and Contributions</heading>

<P>
This document needs contributions, especially for a 
chapter on each languages (<ref id="languages">)
and a chapter on instances of I18N (<ref id="examples">). 
These chapters are consist of contributions.
</P>

<P>
Otherwise, this will be a mere document only on Japanization, 
because the original author Tomohiro KUBOTA 
(<email>kubota@debian.org</email>) 
speaks Japanese and live in Japan.
</P>

<P>
<ref id="spanish"> is written by 
Eusebio C Rufian-Zilbermann <email>eusebio@acm.org</email>.
</P>

<P>
Discussions are held at <tt>debian-devel@lists.debian.org</tt> mailing list.
(May <tt>debian-doc</tt> or <tt>debian-i18n</tt> be more suitable?)
</P>

<chapt id="intro"><heading>Introduction</heading>

<sect id="intro-concepts"><heading>General Concepts</heading>

<P>
Debian system includes many softwares.  Though many of them
have faculty to process, output, and input text data, a part
of these programs assume text as written in English (ASCII).
For people who use non-English language these programs are
hardly usable.  And more, though many softwares which can handle
not only ASCII but also ISO-8859-1, a certain amount of them
cannot handle multibyte characters for CJK (Chinese, Japanese,
and Korean) languages nor combined characters for Thai and so on.
</P>

<P>
So far people who use non-English languages have given up
to use their native languages and have accepted computers as such.
However we should throw away such a wrong idea now.  
It is nonsense that a person who
want to use a computer has to learn English in advance.
</P>

<P>
I18N is needed for the following places.
<list>
 <item>Display characters for users' native languages.
 <item>Input characters for users' native languages.
 <item>Handle files written in popular encodings
       <footnote>
        There are a few terms related to character code,
	such as character set, character code, charset,
	encoding, codeset, and so on.  These words are explained
	later.
       </footnote>
       used for users' native languages.
 <item>Use characters for users' native languages for file names
       and so on.
 <item>Print out characters for users' native languages.
 <item>Display messages by the softwares in users' native languages.
 <item>Formats for input/output of numbers, dates, money, and so on
       obey customs in users' native cultures.
 <item>Classification and sorting rules of characters obey customs
       in users' native cultures.
 <item>Use typesetting and hyphenation rules for the users' native
       languages.
 <item>and so on.
</list>
This document puts stress on the first three items.  This is because
these three items are the basis for the other items.  An another
reason is that:  you cannot at all use softwares lacking the first 
three items, though you can use softwares lacking the other items
inconveniently.  This document will also mention translation of
messages (item 6) which is often called as 'I18N'.  Note that
the author regards the terminology of 'I18N' for calling translation
and <prgn>gettext</prgn>ization is completely wrong.  The reason
may be well explained by the fact that the author did not include
translation and <prgn>gettext</prgn>ization in the important first
three items.
</P>

<P>
Imagine a word processor which can display error
and help messages in your native language while cannot process
your native language.  You will easily understand that the word
processor is not usable.  On the other hand, a word processor which
can process your native language while error and help messages are
displayed only in English is usable, though it is not convenient.
Before we think of developing convenient softwares, we have to
think of developing usable softwares.
</P>

<P>
The following terminology is widely used.
<list>
 <item>I18N (internationalization) means modification of a software
       or related technologies so that a software can potentially
       handle multiple languages, customs, and so on in the world.
 <item>L10N (localization) means implementation of a specific language
       for an already internationalized software.
</list>
However, this terminology is valid only for one specific model
out of a few models which we should regard for I18N.
Now I will introduce a few models other than this I18N-L10N model.
<taglist>
 <tag>a. <strong>L10N</strong> (localization)-model</tag>
   <item><p>
	This model is to support two languages or character codes,
	English (ASCII) and another specific one.  Examples of
	softwares which is developed using this model are:
	Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual
	Emacs) text editor which can input/output Japanese text file,
	Hanterm X terminal emulator which can display and input
	Korean characters via a few Korean encodings.
	Since a programmer has his/her own mother tongue,
	there are numerous L10N patches and L10N softwares 
	written to satisfy his/her own need.
   </p></item>
 <tag>b. <strong>I18N</strong> (internationalization)-model</tag>
   <item><p>
	This model is to support many languages but only two 
	of them, English (ASCII) and another one, at the same time. 
	One have to specify the 'another' language by <tt>LANG</tt> 
	environmental variable or so on.
	The above I18N-L10N-model can be regarded as a part of
	this I18N-model.
	<prgn>gettext</prgn>ization is categorized into I18N-model.
   </p></item>
 <tag>c. <strong>M17N</strong> (multilingualization)-model</tag>
   <item><p>
	This model is to support many languages at the same time.
	For example, Mule (MULtilingual Enhancement to GNU Emacs) 
	can treat a text file which contains multiple languages,
	for example, a paper on difference between Korean and Chinese
	whose main text is written in Finnish.  Now GNU Emacs 20 and 
	XEmacs include Mule.
	Note that M17N-model can only be applied to character-related
	'places'.  For example, it is nonsense to display a message
	like 'file not found' in many languages at the same time.
	Unicode and UTF-8 are technologies which can be used for
	this model.
	<footnote>
	  I recommend not to implement Unicode and UTF-8 directly.
	  Instead, use LOCALE technology and your software will
	  support not only UTF-8 but also many encodings 
	  in the world.  If you implement UTF-8 directly,
	  your software can handle UTF-8 only.  Such a software
	  is not convenient.
	</footnote>
   </p></item>
</taglist>
</P>

<P>
Generally speaking, M17N-model is the best and the next is 
I18N-model.  L10N-model is the worst and you should not take it 
except for a few fields where I18N and M17N-models are very difficult, 
like DTP and X terminal emulator.
In other words, text-processing softwares are 'better' which can treat
many languages at the same time, than can treat two (English and an 
another) languages.
</P>

<P>
Now let me classify approaches for support of non-English languages
from an another viewpoint.
<taglist>
 <tag>A. Implementation <em>without</em> Knowledge on Each Language</tag>
   <item><p>
	This approach is done by utilizing standardized methods supplied 
	by the kernel or libraries.  The most important one is 
	<strong>locale</strong> technology which includes 
	<strong>locale category</strong>, conversion between 
	<strong>multibyte</strong> and <strong>wide
	characters</strong> (<tt>wchar_t</tt>), and so on.
	Another important technology is <prgn>gettext</prgn>.
	The advantages of this approach are (1) that when the kernel or
	libraries are upgraded, the software will automatically
	support new additional languages, (2) that programmers need
	not know each language, and (3) that a user can switch the behavior
	of softwares with common method, like LANG variable.
	The disadvantage is that there are categories or fields where
	a standardized method is not available.  For example, there
	are no standardized methods for text typesetting rules such
	as line-breaking and hyphenation.
   </p></item>
 <tag>B. Implementation Using Knowledge on Each Language</tag>
   <item><p>
	This approach is to directly implement information about 
	each language based on knowledge of programmers and 
	contributors.  L10N almost always uses this approach.
	The advantage of this approach is that detailed and strict
	implementation is possible beyond the field where
	standardized methods are available, such as auto-detection
	of encodings of text files to be read.  Language-specific
	problems can be perfectly solved (of course it depends on
	the skill of the programmer).  The disadvantages are
	(1) that the number of supported languages is restricted
	by the skill or the interest of the programmers or the
	contributors, (2) that labor which should be united and
	concentrated to upgrade the kernel or libraries is dispersed
	into many softwares, that is, re-inventing of the wheel,
	and (3) a user has to learn how to configure each software,
	such as <tt>LESSCHARSET</tt> variable, <tt>.emacs</tt> file, 
	and so on so on.
	This approach can cause problems: for example, GNU roff
	(before version 1.16) assumes <tt>0xad</tt> as a hyphen
	character, which is valid only for ISO-8859-1.
	However a majestic M17N software such as Mule can be
	built by strongly propel this approach.
   </p></item>
</taglist>
</P>

<P>
Using this classification, let me consider L10N, I18N, and M17N-models
from programmer's point of view.
</P>

<P>
L10N-model can be realized only using his/her own knowledge on his/her
language (i.e. approach B).  Since the motivation of L10N is
usually to satisfy programmer's own need, extensiveness for the
third languages is often ignored.
Though L10N-ed softwares are basically useful for people who 
speaks the same language to the programmer, it is sometimes 
useful for other people whose coding system is similar to 
the programmer's.  For example, a software which 
doesn't recognize EUC-JP but doesn't break EUC-JP, will not 
break EUC-KR also. 
</P>

<P>
Main part of I18N-model is, in the case of C program, achieved using
standardized LOCALE technology and <prgn>gettext</prgn>.
An LOCALE approach is classified into I18N because functions
related to LOCALE change their behavior by the current locales
for six categories which are set by <tt>setlocale()</tt>.
Namely, approach A is emphasized for I18N. For field where
standardized methods are not available, however, approach B
cannot be avoided.  Even in such a case, the developers should
be careful so that a support for new languages can be easily added
later even by other developers.
</P>

<P>
M17N-model can be achieved using international encodings such
as ISO 2022 and Unicode.  Though you can hard-code these encodings
for your software (i.e. approach B), I recommend to use standardized
LOCALE technology.  However, using international encdoings
is not sufficient to achieve M17N-model.  You will have to prepare
a mechanism to switch <strong>input methods</strong>.  You will also want
to prepare an encoding-guessing mechanism for input files,
such as <prgn>jless</prgn> and <prgn>emacs</prgn> have.
Mule is the best software which achieved M17N (though it does not
use LOCALE technology).
</P>

<sect id="intro-organization"><heading>Organization</heading>

<P>
Let's preview the contents of each chapter in this document.
</P>

<P>
As I wrote, this document will put stress on correct handling of 
characters and character codes for users' native
languages.  To achieve this purpose, I will start the real contents
of this document by discussing basic important concepts on
characters in <ref id="coding">.  Since this chapter includes
many terminologies, all of you will need to this chapter.
The next chapter, <ref id="codes">, introduces many national
and international standards of <em>coded character sets</em>
and <em>encodings</em>.  I think almost of you can do without
reading this chapter, since <em>LOCALE</em> technology will
enable us to develop international softwares without knowledges
on these character sets and encodings.  However, knowing
about these standards will help you to
understand the merit and necessity of LOCALE technology.
</P>

<P>
The following chapter of <ref id="languages">
describes the detailed informations for 
each language.  These informations will help people who develop
high-quality text processing softwares such as DTP and Web Browsers.
</P>

<P>
Chapter of <ref id="locale"> describes the most important
concept for I18N.  Not only concepts but also many important
C functions are introduced in this chapter.
</P>

<P>
A few following chapters of <ref id="output">, <ref id="input">,
<ref id="internal">, and <ref id="internet"> are important 
and frequent applications of LOCALE technology.
You can get solutions for typical problems on I18N in these
chapters.
</P>

<P>
You may need to develop software using some special libraries
or other languages than C/C++.  Chapters of <ref id="library">
and <ref id="otherlanguage"> are written for such purposes.
</P>

<P>
Next chapter of <ref id="examples"> is a collection of case studies.
Both of generic and special technologies will be discussed.
You can also contribute writing a section for this chapter.
</P>

<P>
You may want to study more; 
The last chapter of <ref id="reference"> is supplied for this purpose.
Some of references listed in the chapter are very important.
</P>


<chapt id="coding"><heading>Important Concepts for Character Coding Systems</heading>

<P>
Character coding system is one of the fundamental elements of the
software and information processing.  
Without proper handling of character codes, your software is
far from realization of internationalization.
Thus the author begins this document with the story on character
codes.
</P>

<P>
In this chapter, basic concepts such as <em>coded character set</em>
and <em>encoding</em> are introduced.  These terms will be needed
to read this document and other documents on internationalization
and character codes including Unicode.
</P>


<sect id="coding-general-term"><heading>Basic Terminology</heading>

<P>
At first I begin this chapter by defining a few very important word.
</P>

<P>
As many people point out, there is a confusion on terminology, since
words are used in various different ways.  The author does not
want to add a new terminology to a confusing ocean of 
various terminologies.  Otherwise, terminology of RFC 2130 will be
adopted in this document, besides one exception of a word 'character
set'.
</P>

<P>
 <taglist>
  <tag><strong>Character</strong>
    <item><p>
          Character is an individual unit of which sentence and text
          consist.  Character is an abstract notion.
    </p></item>
  <tag><strong>Glyph</strong>
    <item><p>
          Glyph is a specific instance of character.  <em>Character</em>
	  and <em>glyph</em> is a pair of words.  Sometimes a character
	  has multiple glyphs (for example, '$' may have one or two vertical
	  bar.  Arabic characters have four glyphs for each character.
	  Some of CJK ideograms have many glyphs).  Sometimes two or more
	  characters construct one glyph (for example, ligature of 'fi').
	  For almost cases, text data, which intend to contain not
	  visual information but abstract idea, don't have to have
	  information on glyphs, since difference between glyphs does
	  not affect the meaning of the text.  However, distinction
	  between different glyphs for a single CJK ideogram may be
	  sometimes important for proper noun such as names of
	  persons and places.  However, there are no standardized method
	  for plain text to have informations on glyphs so far.  This
	  makes plain texts cannot be used for some special fields
	  such as citizen registration system, serious DTP such as
	  newspaper system, and so on.
    </p></item>
  <tag><strong>Encoding</strong>
    <item><p>
          Encoding is a rule where characters and texts are
	  expressed in combinations of bits or bytes in order to
	  treat characters in computers.  Words of <em>character 
	  coding system</em>, <em>character code</em>, <em>charset</em>, 
	  and so on are used to express the same meaning.  
	  Basically, <em>encoding</em> takes care of 
	  <em>characters</em>, not <em>glyphs</em>.
	  There are many official and de-facto standards of encodings
	  such as ASCII, ISO 8859-{1,2,...,15}, 
	  ISO 2022-{JP, JP-1, JP-2, KR, CN, CN-EXT, INT-1, INT-2}, 
	  EUC-{JP, KR, CN, TW}, Johab, UHC, Shift-JIS, Big5, TIS 620, 
	  VISCII, VSCII, so-called 'CodePages', UTF-7, UTF-8, UTF-16LE, 
	  UTF-16BE, KOI8-R, and so on so on.
	  To construct an encoding, we have to consider the
	  following concepts.  (Encoding = one or more 
	  CCS + one CES).
    </p></item>
  <tag><strong>Character Set</strong>
    <item><p>
          Character set is a set of characters.  This determines
	  a range of characters where the encoding can handle.
	  In contrast to <em>coded character set</em>, this is often
	  called as <em>non-coded character set</em>.
    </p></item>
  <tag><strong>Coded Character Set (CCS)</strong>
    <item><p>
          Coded character set (CCS) is a word defined in RFC 2050 
	  and means a character set where all characters
	  have unique numbers by some method.  There are many national
	  and international standards for CCS.
	  Many national standards for CCS adopt
	  the way of coding so that they obey some of international
	  standards such as ISO 646 or ISO 2022.  ASCII, BS 4730,
	  JISX 0201 Roman, and so on are examples of ISO-646 variants.  All 
	  ISO-646 variants, ISO 8859-*, JISX 0208, JISX 0212, KSX 1001,
	  GB 2312, CNS 11643, CCCII, TIS 620, TCVN 5712, and so on are 
	  examples of ISO 2022-compliant CCS.  VISCII and Big5 are 
	  examples of non-ISO 2022-compliant 
	  CCS.  UCS-2 and UCS-4 (ISO 10646) are also examples of CCS.
    </p></item>
  <tag><strong>Character Encoding Scheme (CES)</strong>
    <item><p>
          Character Encoding Scheme is also a word defined in RFC 2050
	  to call methods to construct an encoding using one or 
	  more CCS.  This is important when two or more CCS are used 
	  to construct an encoding.  
	  ISO 2022 is a method to construct an encoding from
	  one or more ISO 2022-compliant CCS.  ISO 2022 is very
	  complex system and subsets of ISO 2022 are usually used
	  such as EUC-JP (ASCII and JISX 0208), ISO-2022-KR (ASCII
	  and KSX 1001), and so on.  CES is not important for 
	  encodings with only one 8bit CCS.
	  UTF series (UTF-8, UTF-16LE, UTF-16BE, and so on) can be
	  regarded as CES whose CCS is Unicode or ISO 10646.
    </p></item>
</taglist>
</P>

<P>
Some other words are usually used related to character codes.
</P>

<P>
<strong>Character code</strong> is a widely-used word to mean 
<em>encoding</em>.  This is an primitive and crude word to call
the way a computer handles characters with assigning numbers.
For example, <em>character code</em> can call <em>encoding</em>
and can call <em>coded character set</em>.  Thus this word can
be used only in the case when both of them can be regard in 
the same category.  This word should be avoided in serious
discussions.  This document will not use this word hereafter.
</P>

<P>
<strong>Codeset</strong> is a word to call <em>encoding</em>
or <em>character encoding scheme</em>.
<footnote>
 This document used a word <em>codeset</em> before Novermber 2000
 to call <em>encoding</em>.  I changed terminology since
 I could not find a word <em>codeset</em> in documents written
 in English (I adopted this word from a book in Japanese).
 <em>encoding</em> seems more popular.
</footnote>
</P>

<P>
<strong>charset</strong> is also a well-used word.
This word is used very widely, for example, in MIME (like 
<tt>Content-Type: text/plain, charset=iso8859-1</tt>),
in XLFD (X Logical Font Description) font name 
(CharSetResigtry and CharSetEncoding fields), and so on.
Note that <em>charset</em> in MIME is <em>encoding</em>,
while <em>charset</em> in XLFD font name is <em>coded character
set</em>.  This is very confusing.  In this document,
<em>charset</em> and <em>character set</em> are used in
XLFD meaning, since I think <em>character set</em> should
mean a set of characters, not encoding.
</P>

<P>
Ken Lunde's "CJKV Information Processing" uses a word
<strong>encoding method</strong>.  He says that
ISO-2022, EUC, Big5, and Shift-JIS are examples of 
<em>encoding methods</em>.  It seems that his <em>encoding
method</em> is <em>CES</em> in this document.  However,
we should notice that Big5 and Shift-JIS are encodings
while ISO-2022 and EUC are not.
<footnote>
During I18N programming, we will frequently meet with EUC-JP
or EUC-KR, while we well rarely meet with EUC.  I think it is
not appropriate to stress EUC, a class of encodings, over
EUC-JP, EUC-KR, and so on, concrete encodings.  It is just like
regarding ISO 8859 as a concrete encoding, though ISO 8859 is
a class of encodings of ISO 8859-{1,2,...,15}.
</footnote>
</P>

<P>
<url id="http://www.unicode.org/unicode/reports/tr17/"
name="Character Encoding Model, Unicode Technilcal Report #17">
(hereafter, <em>"the Report"</em>) suggests five-level model.
<list>
  <item>ACR: abstract character repertoire
  <item>CCS: Coded Character Set
  <item>CEF: Character Encoding Form
  <item>CES: Character Encoding Scheme
  <item>TES: Transfer Encoding Syntax
</list>
</P>

<P>
<strong>TES</strong> is also suggested in RFC 2130.  Some examples of 
TES are: <em>base64</em>, <em>uuencode</em>, <em>BinHex</em>, 
<em>quoted-printable</em>, <em>gzip</em>, and so on.
TES means a transform of encoded data which may (or may not) include
textual data.  Thus, TES is not a part of character encoding.
However, TES is important in the Internet data exchange.
</P>

<P>
When using a computer, we rarely have a chance to face with 
<strong>ACR</strong>.
Though it is true that CJK people have their national standard of
ACR (for example, standard for ideograms which can be used for
personal names) and some of us may need to handle these ACR with
computers (for example, citizen registration system), this is too
heavy theme for this document.  This is because there are no
standardized or encouraged methods to handle these ACR.  You may
have to build the whole system for such purposes.  Good lack!
</P>

<P>
<strong>CCS</strong> in <em>"the Report"</em> is same as what I wrote 
in this document.
It has concrete examples: ASCII, ISO 8859-{1,2,...,15}, JISX 0201,
JISX 0208, JISX 0212, KSX 1001, KSX 1002, GB 2312, Big5, 
CNS 11643, TIS 620, VISCII, TCVN 5712, UCS2, UCS4, and so on.
Some of them are national standards, some are international
standards, and others are de-facto standards.
</P>

<P>
<strong>CEF</strong> and <strong>CES</strong> in <em>"the Report"</em>
correspond to <strong>CES</strong> in this document.
This document will not distinguish these two, since I think there
are no inconvenience.  An encoding with a significant CEF doesn't
have a significant CES (in <em>"the Report"</em> meaning), and 
vice versa.  Then why should we have to distinguish these two?
The only exception is UTF-16 series.  In UTF-16 series, 
UTF-16 is a CEF and UTF-16BE is a CES.  This is the only case where
both of these two leves are needed.
</P>

<P>
Now, <strong>CES</strong> is a concrete concept with concrete
examples: ASCII, ISO 8859-{1,2,...,15}, EUC-JP, EUC-KR, ISO 2022-JP,
ISO 2022-JP-1, ISO 2022-JP-2, ISO 2022-CN, ISO 2022-CN-EXT, 
ISO 2022-KR, ISO 2022, VISCII, UTF-7, UTF-8, UTF-16LE, UTF-16BE, 
and so on.  Now they are encodings themselves.
</P>

<P>
The most important concept in this section is distinction between
<em>coded character set</em> and <em>encoding</em>.  <em>coded
character set</em> is a component of <em>encoding</em>.  Text data
are described in <em>encoding</em>, not <em>coded character set</em>.
</P>


<sect id="stateful"><heading>Stateless and Stateful</heading>

<P>
To construct an encoding with two or more CCS, CES has to supply 
a method to avoid collision between these CCS.
There are two ways to do that.  One is to make all characters
in the all CCS have unique code points.  The other is to
allow characters from different CCS to have the same
code point and to have a code such as escape sequence to switch
<strong>SHIFT STATE</strong>, that is, to select one character set.
</P>

<P>
An encoding with shift state is called <strong>STATEFUL</strong> and
one without shift state is called <strong>STATELESS</strong>.
</P>

<P>
Examples of stateful encodings are: ISO 2022-JP, ISO 2022-KR,
ISO 2022-INT-1, ISO 2022-INT-2, and so on. 
</P>

<P>
For example, in ISO 2022-JP, two bytes of <tt>0x24 0x2c</tt> may mean
a Japanese Hiragana character 'GA' or two ASCII character of
'$' and ',' according to the shift state.
</P>

<sect id="multibyte"><heading>Multibyte encodings</heading>

<P>
Encodings are classified into multibyte ones and the others,
according to the relationship between number of characters and number of
bytes in the encoding.
</P>

<P>
In non-multibyte encoding, one character is always expressed
by one byte.  On the other hand, one character may expressed in
one or more bytes in multibyte encoding.  Note that the number
is not fixed even in a single encoding.
</P>

<P>
Examples of multibyte encodings are: EUC-JP, EUC-KR, ISO 2022-JP,
Shift-JIS, Big5, UHC, UTF-8, and so on.  Note that all of UTF-* are
multibyte.
</P>

<P>
Examples of non-multibyte encodings are: ISO 8859-1, ISO 8859-2,
TIS 620, VISCII, and so on.
</P>

<P>
Note that even in non-multibyte encoding, number of characters
and number of bytes may differ if the encoding is stateful.
</P>

<P>
Ken Lunde's "CJKV Information Processing"
<footnote>
ISBN 1-56592-224-7, O'Reilly, 1999
</footnote>
classifies encoding methods
into the following three categories:
<list>
  <item>modal
  <item>non-modal
  <item>fixed-length
</list>
<em>Modal</em> corresponds to <em>stateful</em> in this document.
Other two are <em>stateless</em>, where <em>non-modal</em> is 
<em>multibyte</em> and <em>fixed-length</em> is
<em>non-multibyte</em>.  However, I think <em>stateful</em> - 
<em>stateless</em> and <em>multibyte</em> - <em>non-multibyte</em>
are independent concept.
<footnote>
though there are no existing encodings which is stateful and
non-multibyte.
</footnote>
</P>

<sect id="number"><heading>Number of Bytes, Number of Characters, and Number of Columns</heading>

<P>
One ASCII character is always expressed by one byte
and occupies one column on console or X terminal emulators
(fixed font for X).
One must not make such an assumption for I18N programming
and have to clearly distinguish number of bytes, characters,
and columns.
</P>

<P>
Speaking of relationship between characters and bytes,
in multibyte encodings, two or more bytes may be needed
to express one character.  In stateful encodings, escape
sequences are not related to any characters.
</P>

<P>
Number of columns is not defined in any standards.  However,
it is usual that CJK ideograms, Japanese Hiragana and Katakana,
and Korean Hangul occupy two columns in console or X terminal emulators.
Note that 'Full-width forms' in UCS-2 and UCS-4 coded character set
will occupy two columns and 'Half-width forms' will occupy one column.
Combining characters used for Thai and so on can be regarded as
zero-column characters.  Though there are no standards, you can
use <tt>wcwidth()</tt> and <tt>wcswidth()</tt> for this purpose.
See <ref id="output-console-column"> for detail.
</P>






<chapt id="codes"><heading>Coded Character Sets And Encodings in the World</heading>

<P>
Here major coded character sets and encodings are introduced.
Note that you don't have to know the detail of these
character codes if you use LOCALE and <tt>wchar_t</tt> technology.
</P>

<P>
However, these knowledge will help you to understand why number
of bytes, characters, and columns should be counted separately,
why <tt>strchr()</tt> and so on should not be used, why you should
use LOCALE and <tt>wchar_t</tt> technology instead of hard-code
processing of existing character codes, and so on so on.
</P>

<P>
These varieties of character sets and encodings will tell you about
struggles of people in the world to handle their own languages by
computers.  Especially, CJK people could not help working out various
technologies to use plenty of characters within ASCII-based computer
systems.
</P>

<P>
If you are planning to develop a text-processing software
beyond the fields which the LOCALE technology covers, you will
have to understand the following descriptions very well.
These fields include automatic detection of encodings
used for the input file (Most of Japanese-capable text viewers
such as <prgn>jless</prgn> and <prgn>lv</prgn> have this mechanism)
and so on.
</P>


<sect id="ascii"><heading>ASCII and ISO 646</heading>

<P>
<strong>ASCII</strong> is a CCS and also an encoding at the same time.
ASCII is 7bit and contains 94 printable characters which are 
encoded in the region of <tt>0x21</tt>-<tt>0x7e</tt>.  
</P>

<P>
<strong>ISO 646</strong> is the international standard of ASCII. 
Following 12 characters of
<list>
   <item>0x23 (number),
   <item>0x24 (dollar),
   <item>0x40 (at),
   <item>0x5b (left square bracket),
   <item>0x5c (backslash),
   <item>0x5d (right square bracket),
   <item>0x5e (caret),
   <item>0x60 (backquote),
   <item>0x7b (left curly brace),
   <item>0x7c (vertical line),
   <item>0x7d (right curly brace), and
   <item>0x7e (tilde)
</list>
are called <strong>IRV</strong> (International Reference Version) 
and other 82 (94 - 12 = 82) characters are called
<strong>BCT</strong> (Basic Code Table).
Characters at IRV can be different between countries.
Here is a few examples of versions of ISO 646.
<list>
  <item>UK version (BS 4730)
  <item>US version (ASCII): 0x23 is pound currency mark, and so on.
  <item>Japanese version (JISX 0201 Roman): 0x5c is yen currency mark, and
        so on.
  <item>Italian version (UNI 0204-70): 0x7b is 'a' with grave accent, and
        so on.
  <item>French version (NF Z 62-010): 0x7b is 'e' with acute accent, and
        so on.
</list>
</P>

<P>
As far as I know, all encodings (besides EBCDIC) in the world 
are compatible with ISO 646.
</P>

<P>
Characters in 0x00 - 0x1f, 0x20, and 0x7f are control characters.
</P>

<P>
Nowadays usage of encodings incompatible with ASCII is not
encouraged and thus ISO 646-* (other than US version) should not 
be used.  One of the reason is that when a string is converted into
Unicode, the converter doesn't know whether IRVs are converted into
characters with same shapes or characters with same codes.  
Another reason is that source codes
are written in ASCII.  Source code must be readable anywhere.
</P>


<sect id="iso8859"><heading>ISO 8859</heading>

<P>
<strong>ISO 8859</strong> is both a series of CCS and a series of 
encodings.  It is an expansion of ASCII using all 8 bits.
Additional 96 printable characters encoded in 0xa0 - 0xff are
available besides 94 ASCII printable characters.
</P>

<P>
There are 10 variants of ISO 8859 (in 1997).
<taglist>
 <tag>ISO-8859-1  Latin alphabet No.1 (1987)</tag>
      <item>characters for western European languages
 <tag>ISO-8859-2  Latin alphabet No.2 (1987)</tag>
      <item>characters for central European languages
 <tag>ISO-8859-3  Latin alphabet No.3 (1988)</tag>
 <tag>ISO-8859-4  Latin alphabet No.4 (1988)</tag>
      <item>characters for northern European languages
 <tag>ISO-8859-5  Latin/Cyrillic alphabet (1988)</tag>
 <tag>ISO-8859-6  Latin/Arabic alphabet (1987)</tag>
 <tag>ISO-8859-7  Latin/Greek alphabet (1987)</tag>
 <tag>ISO-8859-8  Latin/Hebrew alphabet (1988)</tag>
 <tag>ISO-8859-9  Latin alphabet No.5 (1989)</tag>
      <item>same as ISO-8859-1 except for Turkish instead of Icelandic
 <tag>ISO-8859-10 Latin alphabet No.6 (1993)</tag>
      <item>Adds Inuit (Greenlandic) and Sami (Lappish) letters to ISO-8859-4
</taglist>
</P>

<P>
A detailed explanation is found at
<url id="http://park.kiev.ua/mutliling/ml-docs/iso-8859.html">.
</P>


<sect id="iso-2022"><heading>ISO 2022</heading>

<P>
Using ASCII and ISO 646, we can use 94 characters at most.
Using ISO 8859, the number includes to 190 (= 94 + 96).
However, we may want to use much more characters.
Or, we may want to use some, not one, of these character sets.
One of the answer is ISO 2022.
</P>

<P>
<strong>ISO 2022</strong> is an international standard of CES.
ISO 2022 determines a few requirement for CCS to be a member
of ISO 2022-based encodings.  It also defines a very
extensive (and complex) rules to combine these CCS into one
encoding.  Many encodings such as EUC-*, ISO 2022-*,
compound text,
<footnote>
 Compound text is a standard for text exchange between X clients.
</footnote>
and so on can be regarded as subsets of ISO 2022.
ISO 2022 is so complex that you may be not able to understand this.
It is OK;  What is important here is the concept of ISO 2022 of
building an encoding by switching various (ISO 2022-compliant)
coded character sets.
</P>

<P>
The sixth edition of ECMA-35 is fully identical with
ISO 2022:1994 and you can find the official document
at <url id="http://www.ecma.ch/stand/ECMA-035.HTM">.
</P>

<P>
ISO 2022 has two versions of 7bit and 8bit.  At first
8bit version is explained.  7bit version is a subset
of 8bit version.
</P>

<P>
The 8bit code space is divided into four regions,
<list>
 <item>0x00 - 0x1f: C0 (Control Characters 0),
 <item>0x20 - 0x7f: GL (Graphic Characters Left),
 <item>0x80 - 0x9f: C1 (Control Characters 1), and
 <item>0xa0 - 0xff: GR (Graphic Characters Right).
</list>
</P>

<P>
GL and GR is the spaces where (printable) character sets are mapped.
</P>

<P>
Next, all character sets, for example, ASCII, ISO 646-UK,
and JIS X 0208, are classified into following four categories,
<list>
 <item>(1) character set with 1-byte 94-character,
 <item>(2) character set with 1-byte 96-character,
 <item>(3) character set with multibyte 94-character, and
 <item>(4) character set with multibyte 96-character.
</list>
</P>

<P>
Characters in character sets with 94-character are mapped
into 0x21 - 0x7e.  Characters in 96-character set are 
mapped into 0x20 - 0x7f.
</P>

<P>
For example, ASCII, ISO 646-UK, and JISX 0201 Katakana
are classified into (1), JISX 0208 Japanese Kanji, 
KSX 1001 Korean, GB 2312-80 Chinese are classified into (3),
and ISO 8859-* are classified to (2).
</P>

<P>
The mechanism to map these character sets into GL and GR is
a bit complex.  There are four buffers, G0, G1, G2, and G3.  
A character set is <strong>designated</strong> into one of these buffers 
and then a buffer is <strong>invoked</strong> into GL or GR.
</P>

<P>
Control sequences to 'designate' a character set into a
buffer are determined as below.
</P>

<P>
<list>
 <item>A sequence to designate a character set with 1-byte 94-character
    <list>
     <item>into G0 set is: ESC 0x28 F,
     <item>into G1 set is: ESC 0x29 F,
     <item>into G2 set is: ESC 0x2a F, and
     <item>into G3 set is: ESC 0x2b F.
    </list>
 <item>A sequence to designate a character set with 1-byte 96-character
    <list>
     <item>into G1 set is: ESC 0x2d F,
     <item>into G2 set is: ESC 0x2e F, and
     <item>into G3 set is: ESC 0x2f F.
    </list>
 <item>A sequence to designate a character set with multibyte 94-character
    <list>
     <item>into G0 set is: ESC 0x24 0x28 F
       (exception: 'ESC 0x24 F' for F = 0x40, 0x41, 0x42.),
     <item>into G1 set is: ESC 0x24 0x29 F,
     <item>into G2 set is: ESC 0x24 0x2a F, and
     <item>into G3 set is: ESC 0x24 0x2b F.
    </list>
 <item>A sequence to designate a character set with multibyte 96-character
    <list>
     <item>into G1 set is: ESC 0x24 0x2d F,
     <item>into G2 set is: ESC 0x24 0x2e F, and
     <item>into G3 set is: ESC 0x24 0x2f F.
    </list>
</list>
where 'F' is determined for each character set:
<list>
 <item>character set with 1-byte 94-character
    <list>
     <item>F=0x40 for ISO 646 IRV: 1983
     <item>F=0x41 for BS 4730 (UK)
     <item>F=0x42 for ANSI X3.4-1968 (ASCII)
     <item>F=0x43 for NATS Primary Set for Finland and Sweden
     <item>F=0x49 for JIS X 0201 Katakana
     <item>F=0x4a for JIS X 0201 Roman (Latin)
     <item>and more
    </list>
 <item>character set with 1-byte 96-character
    <list>
     <item>F=0x41 for ISO 8859-1 Latin-1
     <item>F=0x42 for ISO 8859-2 Latin-2
     <item>F=0x43 for ISO 8859-3 Latin-3
     <item>F=0x44 for ISO 8859-4 Latin-4
     <item>F=0x46 for ISO 8859-7 Latin/Greek
     <item>F=0x47 for ISO 8859-6 Latin/Arabic
     <item>F=0x48 for ISO 8859-8 Latin/Hebrew
     <item>F=0x4c for ISO 8859-5 Latin/Cyrillic
     <item>and more
    </list>
 <item>character set with multibyte 94-character
    <list>
     <item>F=0x40 for JISX 0208-1978 Japanese
     <item>F=0x41 for GB 2312-80 Chinese
     <item>F=0x42 for JISX 0208-1983 Japanese
     <item>F=0x43 for KSC 5601 Korean
     <item>F=0x44 for JISX 0212-1990 Japanese
     <item>F=0x45 for CCITT Extended GB (ISO-IR-165)
     <item>F=0x46 for CNS 11643-1992 Set 1 (Taiwan)
     <item>F=0x48 for CNS 11643-1992 Set 2 (Taiwan)
     <item>F=0x49 for CNS 11643-1992 Set 3 (Taiwan)
     <item>F=0x4a for CNS 11643-1992 Set 4 (Taiwan)
     <item>F=0x4b for CNS 11643-1992 Set 5 (Taiwan)
     <item>F=0x4c for CNS 11643-1992 Set 6 (Taiwan)
     <item>F=0x4d for CNS 11643-1992 Set 7 (Taiwan)
     <item>and more
    </list>
</list>
The complete list of these coded character set is found at
<url id="http://www.itscj.ipsj.or.jp/ISO-IR/"
 name="International Register of Coded Character Sets">.
</P>

<P>
Control codes to 'invoke' one of G{0123} into GL or GR
is determined as below.
<list>
 <item>A control code to invoke G0 into GL is: (L)SO ((Locking) Shift Out)
 <item>A control code to invoke G1 into GL is: (L)SO ((Locking) Shift In)
 <item>A control code to invoke G2 into GL is: LS2 (Locking Shift 2)
 <item>A control code to invoke G3 into GL is: LS3 (Locking Shift 3)
 <item>A control code to invoke one character 
                         in G2 into GL is: SS2 (Single Shift 2)
 <item>A control code to invoke one character 
                         in G3 into GL is: SS3 (Single Shift 3)
 <item>A control code to invoke G1 into GR is: LS1R (Locking Shift 1 Right)
 <item>A control code to invoke G2 into GR is: LS2R (Locking Shift 2 Right)
 <item>A control code to invoke G3 into GR is: LS3R (Locking Shift 3 Right)
</list>
<footnote>
WHAT IS THE VALUE OF THESE CONTROL CODES?
</footnote>
</P>

<P>
Note that a code in a character set invoked into GR is
or-ed with 0x80.
</P>

<P>
ISO 2022 also determines <strong>announcer</strong> code.  For example, 
'ESC 0x20 0x41' means 'Only G0 buffer is used.  G0 is already
invoked into GL'.  This simplify the coding system.  Even this
announcer can be omitted if people who exchange data agree.
</P>

<P>
7bit version of ISO 2022 is a subset of 8bit version.  It does not
use C1 and GR.
</P>

<P>
Explanation on C0 and C1 is omitted here.
</P>



<sect1 id="euc"><heading>EUC (Extended Unix Code)</heading>

<P>
<strong>EUC</strong> is a CES which is a subset of 8bit version 
of ISO 2022 except for the usage of SS2 and SS3 code. Though these
codes are used to invoke G2 and G3 into GL in ISO 2022, they are
invoked into GR in EUC.
<strong>EUC-JP</strong>, <strong>EUC-KR</strong>, <strong>EUC-CN</strong>,
and <strong>EUC-TW</strong> are widely used encodings
which use EUC as CES.
</P>

<P>
EUC is stateless.
</P>

<P>
EUC can contain 4 CCS by using G0, G1, G2, and G3.
Though there is no requirement that ASCII is designated to G0,
I don't know any EUC codeset in which ASCII is not designated to G0.
</P>

<P>
For EUC with G0-ASCII, all codes other than ASCII are encoded 
in 0x80 - 0xff and this is upward compatible to ASCII.
</P>

<P>
Expressions for characters in G0, G1, G2, and G3 character sets
are described below in binary:
<list>
 <item>G0: 0???????
 <item>G1: 1??????? [1??????? [...]]
 <item>G2: SS2 1??????? [1??????? [...]]
 <item>G3: SS3 1??????? [1??????? [...]]
</list>
where SS2 is 0x8e and SS3 is 0x8f.
</P>



<sect1 id="iso2022set"><heading>ISO 2022-compliant Character Sets</heading>

<P>
There are many national and international standards of coded
character sets (CCS).  Some of them are ISO 2022-compliant
and can be used in ISO 2022 encoding.
</P>

<P>
ISO 2022-compliant CCS are classified into one of them:
<list>
  <item>94 characters
  <item>96 characters
  <item>94x94x94x... characters
</list>
</P>

<P>
The most famous 94 character set is US-ASCII.  Also, all
ISO 646 variants are ISO 2022-compliant 94 character sets.
</P>

<P>
All ISO 8859-* character sets are ISO 2022-compliant
96 character sets.
</P>

<P>
There are many 94x94 character sets.  All of them are related to
CJK ideograms.
<taglist>
  <tag><strong>JISX 0208</strong> (aka JIS C 6226)
  <item><p>National standard of Japan.  1978 version contains 6802 characters
        including Kanji (ideogram), Hiragana, Katakana, Latin, Greek,
	Cyrillic, numeric, and other symbols.  The current (1997) version
	contains 7102 characters.</p>
  <tag><strong>JISX 0212</strong>
  <item><p>National standard of Japan.  6067 characters (almost of them
        are Kanji).  This character set is intended to be used in 
	addition to JISX 0208.</p>
  <tag><strong>JISX 0213</strong>
  <item><p>Japanese national standard.  Released recently.  
        Intended to be used in addition to JISX 0208.  Share many 
	characters with JISX 0212.</p>
  <tag><strong>KSX 1001</strong> (aka KSC 5601)
  <item><p>National standard of South Korea.  8224 characters including
	2350 Hangul, Hanja (ideogram), Hiragana, Katakana, Latin,
	Greek, Cyrillic, and other symbils.  Hanja are ordered in 
	reading and Hanja with multiple	readings are coded multiple times.</p>
  <tag><strong>KSX 1002</strong>
  <item><p>National standard of South Korea.  7659 characters including
        Hangul and Hanja.  Intended to be used in addition to KSX 1001.</p>
  <tag><strong>KPS 9566</strong>
  <item><p>National standard of North Korea. Similar to KSX 1001.</p>
  <tag><strong>GB 2312</strong>
  <item><p>National standard of China.  7445 characters including
        6763 Hanzi (ideogram), Latin, Greek, Cyrillic, Hiragana,
	Katakana, and other symbols.</p>
  <tag><strong>GB 7589</strong> (aka GB2)
  <item><p>National standard of China.  7237 Hanzi.  Intended to be
        used in addition to GB 2312.</p>
  <tag><strong>GB 7590</strong> (aka GB4)
  <item><p>National standard of China.  7039 Hanzi.  Intended to be
        used in addition to GB 2312 and GB 7589.</p>
  <tag><strong>GB 12345</strong> (aka GB/T 12345, GB1 or GBF)
  <item><p>National standard of China.  7583 characters.  Traditional
        characters version which correspond to GB 2312 simplified 
	characters.
  <tag><strong>GB 13131</strong> (aka GB3)
  <item><p>National standard of China.  Traditional
        characters version which correspond to GB 7589 simplified 
	characters.
  <tag><strong>GB 13132</strong> (aka GB5)
  <item><p>National standard of China.  Traditional
        characters version which correspond to GB 7590 simplified 
	characters.
  <tag><strong>CNS 11643</strong>
  <item><p>National standard of Taiwan.  Has 7 plains.  Plain 1 and
        2 includes all characters included in Big5.  Plain 1 includes
	6085 characters including Hanzi (ideogram), Latin, Greek,
	and other symbols.  Plain 2 includes 7650. Number of character
	for plain 3 is 6184, plain 4 is 7298, plain 5 is 8603, 
	plain 6 is 6388, and plain 7 is 6539.
</taglist>
</P>

<P>
There is a 94x94x94 character set.  This is <strong>CCCII</strong>.
This is national standard of Taiwan.  Now 73400 characters are
included.  (The number is increasing.)
</P>

<P>
Non-ISO 2022-compliant character sets are introduced later in
<ref id="othercodes">.
</P>

<sect1 id="iso2022enc"><heading>ISO 2022-compliant Encodings</heading>

<p>
There are many ISO 2022-compliant encodings which are subsets
of ISO 2022.
</p>

<P>
<taglist>
  <tag><strong>Compound Text</strong>
  <item><p>
        This is used for X clients to communicate each other,
	for example, copy-paste.
	</P>
  <tag><strong>EUC-JP</strong>
  <item><p>An EUC encoding with ASCII, JISX 0208, JISX 0201 Kana,
        and JISX 0212 coded character sets.  There are many systems
	which does not support JISX 0201 Kana and JISX 0212.
	Widely used in Japan for POSIX systems.
        </p>
  <tag><strong>EUC-KR</strong>
  <item><p>An EUC encoding with ASCII and KSX 1001.
        </p>
  <tag><strong>CN-GB</strong> (aka EUC-CN)
  <item><p>An EUC encoding with ASCII and GB 2312.
	The most popular encoding in R. P. China.  This encoding
        is sometimes referred as simply 'GB'.
        </p>
  <tag><strong>EUC-TW</strong>
  <item><p>An extended EUC encoding with ASCII, CNS 11643 plain 1,
	and other (2-7) plains of CNS 11643.
        </p>
  <tag><strong>ISO 2022-JP</strong>
  <item><p>Described in RFC 1468.
        </p>
	<P>***** Not written yet *****</P>
  <tag><strong>ISO 2022-JP-1</strong> (upper compatible to ISO 2022-JP)
  <item><p>Described in RFC 2237.
        </p>
	<P>***** Not written yet *****</P>
  <tag><strong>ISO 2022-JP-2</strong> (upper compatible to ISO 2022-JP-1)
  <item><p>Described in RFC 1554.
        </p>
	<P>***** Not written yet *****</P>
  <tag><strong>ISO 2022-KR</strong>
  <item><p>aka Wansung. Described in RFC 1557.
        </p>
	<P>***** Not written yet *****</P>
  <tag><strong>ISO 2022-CN</strong>
  <item><p>Described in RFC 1922.
        </p>
	<P>***** Not written yet *****</P>
  <tag><strong>ISO 2022-CN-EXT</strong> (upper compatible to ISO 2022-CN-EXT)
  <item><p>
        </p>
</taglist>
</P>

<P>
Non-ISO 2022-compliant encodings are introduced later in
<ref id="othercodes">.
</P>

<sect id="unicodes"><heading>ISO 10646 and Unicode</heading>

<P>
ISO 10646 and Unicode are an another standard so that we can
develop international softwares easily.  The special features
of this new standard are:
<list>
  <item>A united single CCS which intends to include all characters
        in the world.  (ISO 2022 consists of multiple CCS.)
  <item>The character set intends to cover all conventional
        (or <em>legacy</em>) CCS in the world.
        <footnote>
         This is obviously not true for CNS 11643 because
	 CNS 11643 contains 48711 characters while Unicode 3.0.1 
	 contains 49194 characters, only 483 excess than CNS 11643.
	</footnote>
  <item>Compatibility with ASCII and ISO 8859-1 is considered.
  <item>Chinese, Japanese, and Korean ideograms are united.
        This comes from a limitation of Unicode.
        This is not a merit.
</list>
</P>

<P>
ISO 10646 is an official international standard.  Unicode is 
developed by 
<url id="http://www.unicode.org" name="Unicode Consortium">.
These two are almost identical.  Indeed, these two are exactly
identical at code points which are available in both two standards.
Unicode is sometimes updated and the newest version is 3.0.1.
</P>

<sect1 id="unicodes-ccs"><heading>UCS as a Coded Character Set</heading>

<P>
ISO 10646 defines two CCS (coded character sets), <strong>UCS-2</strong>
and <strong>UCS-4</strong>.  UCS-2 is a subset of UCS-4.
</P>

<P>
UCS-4 is a 31bit CCS. These 31 bits are divided into 7, 8, 8, and 8 bits
and each of them has special term.
<list>
  <item>The top 7 bits are called <strong>Group</strong>.
  <item>Next 8 bits are called <strong>Plane</strong>.
  <item>Next 8 bits are <strong>Row</strong>.
  <item>The smallest 8 bits are <strong>Cell</strong>.
</list>
The first plane (Group = 0, Plane = 0) is called <strong>BMP</strong> 
(Basic Multilingual Plane) and UCS-2 is same to BMP.
Thus, UCS-2 is a 16bit CCS.
</P>

<P>
Code points in UCS are often expressed as <strong>u+<tt>????</tt></strong>, 
where <tt>????</tt> is hexadecimal expression of the code point.
</P>

<P>
Characters in range of u+0021 - u+007e are same to ASCII and
characters in range of u+0xa0 - u+0xff are same to ISO 8859-1.
Thus it is very easy to convert between ASCII or ISO 8859-1 and UCS.
</P>

<P>
Unicode (version 3.0.1) uses a 20bit subset of UCS-4 as a CCS.
<footnote>
 Exactly speaking, u+000000 - u+10ffff.
</footnote>
</P>

<P>
The unique feature of these CCS compared with other CCS is
<em>open repertoire</em>.  They are developing even after
they are released.  Characters will be added in future.
However, already coded characters will not changed.
Unicode version 3.0.1 includes 49194 distinct coded characters.
</P>

<sect1 id="unicode-ces"><heading>UTF as Character Encoding Schemes</heading>

<P>
A few CES are used to construct encodings which use UCS as
a CCS.  They are <strong>UTF-7</strong>, <strong>UTF-8</strong>, 
<strong>UTF-16</strong>, <strong>UTF-16LE</strong>, and 
<strong>UTF-16BE</strong>.  UTF means Unicode (or UCS)
Transformation Format.
Since these CES always take UCS as the only CCS, they are also
names for encodings.
<footnote>
 Compare UTF and EUC.  There are a few variants of EUC whose CCS
 are different (EUC-JP, EUC-KR, and so on).  This is why we cannot
 call EUC as an encoding.  In other words, calling of 'EUC' 
 cannot specify an encoding.  On the other hands, 'UTF-8'
 is the name for a specific concrete encoding.
</footnote>
</P>

<sect2 id="unicode-utf8"><heading>UTF-8</heading>

<P>
UTF-8 is an encoding whose CCS is UCS-4.  UTF-8
is designed to be upward-compatible to ASCII.
UTF-8 is multibyte and number of bytes needed to express
one character is from 1 to 6.
</P>

<P>
Conversion from UCS-4 to UTF-8 is performed using a 
simple conversion rule.
<example>
UCS-4 (binary)                       UTF-8 (binary)
00000000 00000000 00000000 0???????  0???????
00000000 00000000 00000??? ????????  110????? 10??????
00000000 00000000 ???????? ????????  1110???? 10?????? 10??????
00000000 000????? ???????? ????????  11110??? 10?????? 10?????? 10??????
000000?? ???????? ???????? ????????  111110?? 10?????? 10?????? 10?????? 10??????
0??????? ???????? ???????? ????????  1111110? 10?????? 10?????? 10?????? 10?????? 10??????
</example>
Note the shortest one will be used though longer representation can 
express smaller UCS values.
</P>

<P>
UTF-8 seems to be one of the major candidates for standard codesets 
in the future.  For example, Linux console and xterm supports UTF-8.
Debian package of <package>locales</package> (version 2.1.97-1)
contains <tt>ko_KR.UTF-8</tt> locale.  I think the number of UTF-8
locale will increase.
</P>

<sect2 id="unicode-utf16"><heading>UTF-16</heading>

<P>
UTF-16 is an encoding whose CCS is 20bit Unicode.
</P>

<P>
Characters in BMP are expressed using 16bit value of
code point in Unicode CCS.  There are two ways to express
16bit value in 8bit stream.  Some of you may heard a word
<em>endian</em>.  <em>Big endian</em> means an arrangement
of octets which are part of a datum with many bits
from most significant octet to least significant one.
<em>Little endian</em> is opposite.  For example, 
16bit value of <tt>0x1234</tt> is expressed as 
<tt>0x12 0x34</tt> in
big endian and <tt>0x34 0x12</tt> in little endian.
</P>

<P>
UTF-16 supports both endians.  Thus, Unicode character of
<tt>u+1234</tt> can be expressed either in <tt>0x12 0x34</tt>
or <tt>0x34 0x12</tt>.  Instead, the UTF-16 texts
have to have a <strong>BOM (Byte Order Mark)</strong> at first
of them.  The Unicode character <tt>u+feff</tt> zero width no-break
space is called BOM when it is used to indicate the byte order
or endian of texts.  The mechanism is easy: in big endian,
<tt>u+feff</tt> will be <tt>0xfe 0xff</tt> while it will be
<tt>0xff 0xfe</tt> in little endian.  Thus you can understand
the endian of the text by reading the first two bytes.
<footnote>
 I heard that BOM is mere a suggestion by a vendor.
 Read <url id="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
 name="Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux">
 for detail.
</footnote>
</P>

<P>
Characters in out of BMP are expressed using <strong>surrogate
pair</strong>.  Code points of <tt>u+d800</tt> - <tt>u+dfff</tt>
are reserved for this purpose.  At first, 20 bits of Unicode code
point are divided into two sets of 10 bits.  The significant 10 bits
are mapped to 10bit space of <tt>u+d800</tt> - <tt>u+dbff</tt>.
The smaller 10 bits are mapped to 10bit space of <tt>u+dc00</tt> -
<tt>u+dfff</tt>.  Thus UTF-16 can express 20bit Unicode characters.
</P>

<sect2 id="unicode-utf16bele"><heading>UTF-16BE and UTF-16LE</heading>

<P>
UTF-16BE and UTF-16LE are variants of UTF-16 which are limited to
big and little endians, respectively.
</P>


<sect2 id="unicode-utf7"><heading>UTF-7</heading>

<P>
UTF-7 is designed so that Unicode can be communicated using
7bit communication path.
</P>

<P>***** Not written yet *****</P>

<sect2 id="unicode-ucs"><heading>UCS-2 and UCS-4 as encodings</heading>

<P>
Though I introduced UCS-2 and UCS-4 are CCS, they can be encodings.
</P>

<P>
In UCS-2 encoding, Each UCS-2 character is expressed in two bytes.
In UCS-4 encoding, Each UCS-4 character is expressed in four bytes.
</P>

<sect1 id="unicode-problem"><heading>Problems on Unicode</heading>

<P>
All standards are not free from politics and compromise.
Though a concept of united single CCS for all characters in the
world is very nice, Unicode had to consider compatibility
with preceding international and local standards.  And more,
unlike the ideal concept, Unicode people considered efficiency
too much.  IMHO, surrogate pair is a mess caused by lack of
16bit code space.  I will introduce a few problems on Unicode.
</P>

<sect2 id="unihan"><heading>Han Unification</heading>

<P>
This is the point on which Unicode is criticized most strongly
among many Japanese (and also among Korean and Chinese, I suppose)
people.
</P>

<P>
A region of 0x4e00 - 0x9fff in UCS-2 is used for Eastern-Asian
ideographs (Japanese Kanji, Chinese Hanzi, and Korean Hanja).  
There are similar characters
in these four character sets. (There are two sets of Chinese characters, 
simplified Chinese used in P. R. China and traditional Chinese used in 
Taiwan).  To reduce the number of these ideograms to be encoded
(the region for these characters can contain only 20992 characters
while only Taiwan CNS 11643 standard contains 48711 characters),
these similar characters are assumed to be the same.
This is Han Unification.
</P>

<P>
However these characters are not exactly the same.  If fonts for
these characters are made from Chinese one, Japanese people will
regard them wrong characters, though they may be able to read.
Unicode people think these united characters are the same character
with different glyphs.
</P>

<P>
An example of Han Unification is available at
<url id="http://charts.unicode.org/unihan/unihan.acgi$0x9AA8" name="U+9AA8">.
This is a Kanji character for 'bone'. 
<url id="http://charts.unicode.org/unihan/unihan.acgi$0x8FCE" name="U+8FCE">
is an another example of a Kanji character for 'welcome'.  The part
from left side to bottom side is 'run' radical.  'Run' radical
is used for many Kanjis and all of them have the same problem.
<url id="http://charts.unicode.org/unihan/unihan.acgi$0x76F4" name="U+76F4">
is an another example of a Kanji character for 'straight'.
I, a native Japanese speaker, cannot recognize Chiense version
at all.
<footnote>
  Unicode's <url id="http://www.unicode.org/unicode/faq/han_cjk.html"
   name="FAQ - Han and CJK -"> page reads that
  <em>
  These differences of writing style are within the general 
  range of allowable differences within each typographic tradition. 
  </em>  However, this is wrong description.  If you want to develop
  Unicode software and want your software be accepted by CJK
  people, you have to think about this problem.
</footnote>
</P>

<P>
Unicode font vendors will hesitate to choose fonts for these characters,
simplified Chinese character, traditional Chinese one, Japanese one, or 
Korean one.  One method is to supply four fonts of simplified Chinese 
version, traditional Chinese version, Japanese version, and Korean version.
Commercial OS vendor can release localized version of their OS ---
for example, Japanese version of MS Windows can include Japanese version
of Unicode font (this is what they are exactly doing).  However, how 
should XFree86 or Debian do?  I don't know...
<footnote>
  XFree86 4.0 includes Japanese and Korean versions of ISO 10646-1 fonts.
</footnote>
</P>

<sect2 id="combining"><heading>Combining Characters</heading>

<P>
Unicode has a way to synthesize a accented character by combining
an accent symbol and a base character.  For example, combining 'a' and
'~' makes 'a' with tilde.  More than two accent symbol can be added to
a base character.  
</P>

<P>
Languages such as Thai need combining characters.  Combining characters
are the only method to express characters in these languages.
</P>

<P>
However, a few problems arises.
<taglist>
 <tag>Duplicate Encoding</tag>
    <item>
    There are multiple ways to express the same character.
    For example, u with umlaut can be expressed as <tt>u+00fc</tt>
    and also as <tt>u+0075</tt> + <tt>U+0308</tt>.
    How can we implement 'grep' and so on?
 <tag>Open Repertoire</tag>
    <item>
    Number of expressible characters grows unlimitedly.
    Non-existing characters can be expressed.
</taglist>
</P>


<sect2 id="surrogate"><heading>Surrogate Pair</heading>

<P>
The first version of Unicode had only 16bit code space,
though 16bit is obviously insufficient to contain all
characters in the world.
<footnote>
  There are a few projects such as
  <url id="http://www.mojikyo.gr.jp/" name="Mojikyo">
  (about 90000 characters),
  <url id="http://www.tron.org/index-e.html" name="TRON project">
  (about 130000 characters),
  and so on to develop a CCS which contains
  sufficient characters for professional usage in CJK world.
</footnote>
Thus surrogate pair is introduced in Unicode 2.0, to expand the
number of characters, with keeping compatibility with former
16bit Unicode.
</P>

<P>
However, surrogate pair breaks the principle that all characters
are expressed with the same width of bits.  This makes Unicode
programming more difficult.
</P>

<P>
Fortunately, Debian and other UNIX-like systems will use UTF-8 
(not UTF-16) as a usual encoding for UCS.  Thus, we don't need
to handle UTF-16 and surrogate pair very often.
</P>

<sect2 id="646problem"><heading>ISO 646-* Problem</heading>

<P>
You will need a codeset converter between your local encodings
(for example, ISO 8859-* or ISO 2022-*) and Unicode.
For example, Shift-JIS encoding
<footnote>
  The standard encoding for Macintosh and MS Windows.
</footnote>
consists from
JISX 0201 Roman (Japanese version of ISO 646), not ASCII,
which encodes yen currency mark at <tt>0x5c</tt>
where backslash is encoded in ASCII.  
</P>

<P>
Then which should your converter convert <tt>0x5c</tt> in Shift-JIS
into in Unicode, <tt>u+005c</tt> (backslash) or <tt>u+00a5</tt> 
(yen currency mark)?
You may say yen currency mark is the right solution.
However, backslash (and then yen mark) is widely used for
escape character. For example, 'new line' is expressed as
'backslash - <tt>n</tt>' in C string literal and Japanese people use
'yen currency mark - <tt>n</tt>'.  You may say that program sources
must written in ASCII and the wrong point is that you 
tried to convert program source.  However, there are many
source codes and so on written in Shift-JIS encoding.
</P>

<P>
Now Windows comes to support Unicode and the font
at <tt>u+005c</tt> for Japanese version of Windows is yen currency mark.
As you know, backslash (yen currency mark in Japan) is vitally 
important for Windows, because it is used to separate directory names.
Fortunately, EUC-JP, which is widely used for UNIX in Japan,
includes ASCII, not Japanese version of ISO 646.  So this 
is not problem because it is clear <tt>0x5c</tt> is backslash.
</P>

<P>
Thus all local codesets should not use character sets incompatible
to ASCII, such as ISO 646-*.
</P>

<P>
<url id="http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html"
name="Problems and Solutions for Unicode and User/Vendor Defined
Characters"> discusses on this problem.
</P>

<sect id="othercodes"><heading>Other Character Sets and Encodings</heading>

<P>
Besides ISO 2022-compliant coded character sets and encodings
described in <ref id="iso2022set"> and <ref id="iso2022enc">,
there are many popular encodings which cannot be classified
into an international standard (i.e., not ISO 2022-compliant
nor Unicode).  Internationalized softwares should
support these encodings (again, you don't need to be aware of
encodings if you use LOCALE and <tt>wchar_t</tt> technology).
Some organizations are developing systems which go father than 
limitations of the current international standards, though these
systems may be not diffused very much so far.
</P>

<sect1 id="othercodes-big5"><heading>Big5</heading>

<P>
<strong>Big5</strong> is a de-facto standard encoding for 
Taiwan (1984) and is upper-compatible with ASCII. 
It is also a CCS.
</P>

<P>
In Big5, <tt>0x21</tt> - <tt>0x7e</tt> means ASCII characters.
<tt>0xa1</tt> - <tt>0xfe</tt> makes a pair with the following byte
(<tt>0x40</tt> - <tt>0x7e</tt> and <tt>0xa1</tt> - <tt>0xfe</tt>)
and means an ideogram and so on (13461 characters).  
<P>

<P>
Though Taiwan has ISO 2022-compliant new standard CNS 11643,
Big5 seems to be more popular than CNS 11643.
(CNS 11643 is a CCS and there are a few ISO 2022-derived
encodings which include CNS 11643.)
</P>

<sect1 id="othercodes-uhc"><heading>UHC</heading>

<P>
<strong>UHC</strong> is an encoding which is an upward-compatible
with <strong>EUC-KR</strong>.  Two-byte characters (the first byte:
<tt>0x81</tt> - <tt>0xfe</tt>; the second byte: 
<tt>0x41</tt> - <tt>0x5a</tt>, <tt>0x61</tt> - <tt>0x7a</tt>, and
<tt>0x81</tt> - <tt>0xfe</tt>) include KSX 1001 and other Hangul so 
that UHC can
express all 11172 Hangul.
</P>

<sect1 id="othercodes-johab"><heading>Johab</heading>

<P>
<strong>Johab</strong> is an encoding whose character set is identical
with <strong>UHC</strong>, i.e., ASCII, KSX 1001, and all other Hangul
character.
Johab means combination in Korean.  In Johab, code point of a Hangul
can be calculated from combination of Hangul parts (Jamo).
</P>

<sect1 id="othercodes-hz"><heading>HZ, aka HZ-GB-2312</heading>

<p>
<strong>HZ</strong> is an encoding described in RFC1842.  CCS 
(Coded character sets) of HZ is ASCII and GB2312.  This is 7bit 
encoding.
</p>

<p>
Note that HZ is <em>not</em> upper-compatible with ASCII,
since '<tt>~{</tt>' means GB2312 mode, '<tt>~}</tt>' means
ASCII mode, and '<tt>~~</tt>' means ASCII '~'.
</p>

<sect1 id="othercodes-gbk"><heading>GBK</heading>

<p>
<strong>GBK</strong> is an encoding which is upward-compatible
to CN-GB.  GBK covers ASCII, GB2312, other Unicode 1.0 ideograms,
and a bit more.  The range of two-byte characters in GBK is:
<tt>0x81</tt> - <tt>0xfe</tt> for the first byte and
<tt>0x40</tt> - <tt>0x7e</tt> and <tt>0x80</tt> - <tt>0xfe</tt>
for the second byte.  21886 code points out of 23940 in two-byte
region are defined.
</p>

<p>
GBK is one of popular encodings in R. P. China.
</p>

<sect1 id="othercodes-gb18030"><heading>GB18030</heading>

<p>
<strong>GB 18030</strong> is an encoding which is upward-compatible
to GBK and CN-GB.  It is an recent national standard (released on 
17 March 2000) of China.  It adds four-byte characters to GBK.
Its range is:
<tt>0x81</tt> - <tt>0xfe</tt> for the first byte,
<tt>0x30</tt> - <tt>0x39</tt> for the second byte,
<tt>0x81</tt> - <tt>0xfe</tt> for the third byte, and
<tt>0x30</tt> - <tt>0x39</tt> for the forth byte.
</p>

<p>
It includes all characters of Unicode 3.0's Unihan Extension A.
And more, GB 18030 supplies code space for all used and
unused code points of Unicode's plane 0 (BMP) and 16 additional
planes.
</p>

<p>
<url id="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf"
name="A detailed explanation on GB18030"> is available.
</p>

<sect1 id="othercodes-gccs"><heading>GCCS</heading>

<p>
<strong>GCCS</strong> is a standard of coded character set
by Hong Kong (HKSAR: Hong Kong Special Administrative Region).
It includes 3049 characters.  It is an abbreviation of Government Common
Character Set.  It is defined as an <em>additional character set
for Big5</em>.  Characters in GCCS are coded in User-Defined Area
(just like Private Use Area for UCS) in Big5.
</p>

<sect1 id="othercodes-hkscs"><heading>HKSCS</heading>

<p>
<strong>HKSCS</strong> is an expansion and amendment of GCCS.
It includes 4702 characters.  It means Hong Kong Supplementary
Character Set.
</p>

<p>
In addition to a usage in User-Defined Area in Big5,
HKSCS defines a usage in Private Use Area in Unicode.
</p>

<sect1 id="othercodes-shiftjis"><heading>Shift-JIS</heading>

<p>
<strong>Shift-JIS</strong> is one of popular encodings in Japan.
Its CCS are JISX 0201 Roman, JISX 0201 Kana, and JISX 0208.
</p>

<p>
JISX 0201 Roman is Japanese version of ISO 646.  It defines
yen currency mark for <tt>0x5c</tt>, where ASCII has backslash.
<tt>0xa1</tt> - <tt>0xdf</tt> is one-byte character and is 
JISX 0201 Kana.  Two-byte character (the first byte:
<tt>0x81</tt> - <tt>0x9f</tt> and <tt>0xe0</tt> - <tt>0xef</tt>;
the second byte: <tt>0x40</tt> - <tt>0x7e</tt> and <tt>0x80</tt> -
<tt>0xfc</tt>) is JISX 0208.
</p>

<p>
Japanese version of MS DOS, MS Windows and Macintosh use this encoding,
though this encoding is not often used in POSIX systems.
</p>


<sect1 id="othercodes-viscii"><heading>VISCII</heading>

<P>
Vietnamese language uses 186 characters (Latin alphabets with accents)
and other symbols.
It is a bit more than the limit of ISO 8859-like encoding.
</P>

<P>
<strong>VISCII</strong> is a standard for Vietnamese.  
It is upper-compatible with ASCII.  It is 8bit and stateless, 
like ISO 8859 series.  However, it uses code points of
not only <tt>0x21</tt> - <tt>0x7e</tt> and <tt>0xa0</tt> - 
<tt>0xff</tt> but also <tt>0x02</tt>, <tt>0x05</tt>, <tt>0x06</tt>, 
<tt>0x14</tt>, <tt>0x19</tt>, <tt>0x1e</tt>, and <tt>0x80</tt> -
<tt>0x9f</tt>.  This makes VISCII not-ISO 2022-compliant.
</P>

<P>
Vietnam has a new, ISO 2022-compliant character set
<strong>TCVN 5712 VN2</strong> (aka <strong>VSCII</strong>).
In TCVN 5712 VN2, accented characters are expressed as a
combined character.  Note that some of accented characters
have their own code points.
</P>

<sect1 id="othercodes-tron"><heading>TRON</heading>

<P>
<url id="http://www.tron.org/index-e.html" name="TRON">
is a project to develop a new operating system,
founded as a collaboration of industries and academics
in Japan since 1984.
</P>

<P>
The most diffused version of TRON operating system families
is ITRON, a real-time OS for embedded systems.
However, our interest is not on ITRON now.
TRON determines a TRON encoding.
</P>

<P>
TRON's encoding is stateful.  Each state are assigned
to each language.  It has already defined about 130000 characters
(January 2000).
</P>

<sect1 id="othercodes-mojikyo"><heading>Mojikyo</heading>

<P>
<url id="http://www.mojikyo.gr.jp/" name="Mojikyo">
is an project to develop an environment by which a user
can use many characters in the world.  Mojikyo
project has released an application software for
MS Windows to display and input about 90000 characters.
You can download the software and TrueType, TeX, and
CID fonts, though they are not DFSG-free.
</P>



<chapt id="languages"><heading>Characters in Each Country</heading>

<P>
This chapter describes a specific information for each language.
If you are developing a serious DTP software or planning to support
detailed I18N, this chapter may help you.
Contributions from people speaking each language are welcome.
If you are to write a section on your language, please include
these points:
<enumlist>
  <item>kinds and number of characters used in the language,
  <item>explanation on coded character set(s) which is (are) standardized,
  <item>explanation on encoding(s) which is (are) standardized,
  <item>usage and popularity for each encoding,
  <item>de-facto standard, if any, on how many columns characters occupy, 
  <item>writing direction and combined characters,
  <item>how to layout characters (word wrapping and so on),
  <item>widely used value for <tt>LANG</tt> environmental variable,
  <item>the way to input characters from keyboard and whether
        you want to input yes/no (and so on) in your language 
        or in English,
  <item>a set of information needed for beautiful displaying, for example, 
        where to break a line, hyphenation, word wrapping, and so on, and
  <item>other topics.
</enumlist>
</P>


<P>
Writers whose languages are written in different direction
from European languages or needs a combined characters
(I heard that is used in Thai) are encouraged to explain 
how to treat such languages.
</P>



&japanese-japan;
&spanish;
&cyrillic;





<chapt id="locale"><heading>LOCALE technology</heading>

<P>
<strong>LOCALE</strong> is a basic concept introduced
into <strong>ISO C</strong> (ISO/IEC 9899:1990).  The 
standard is expanded in 1995 (ISO 9899:1990 Amendment 1:1995).  
In LOCALE model, the behaviors of some C functions are dependent
on LOCALE environment.  LOCALE environment is divided
into a few categories and each of these categories can
be set independently using <tt>setlocale()</tt>.
</P>

<P>
<strong>POSIX</strong> also determines some standards around 
i18n.  Almost of POSIX and ISO C standards are included in
<strong>XPG4</strong> (X/Open Portability Guide) standard and
all of them are included in XPG5 standard.  Note that 
<strong>XPG5</strong> is included in UNIX specifications version 2.  
Thus support of XPG5 is mandatory to obtain Unix brand.  In other words,
all versions of Unix operating systems support XPG5.
</P>

<P>
The merit of using locale technology over hard-coding of Unicode
is:
<list>
  <item>The software can be written encoding-independent way.
        This means that this software can support all encodings
	which the OS supports, including 7bit, 8bit, multibyte,
	stateful, and stateless encodings such as ASCII, ISO 8859-*,
	EUC-*, ISO 2022-*, Big5, VISCII, TIS 620, UTF-*, and so on.
  <item>The software will provides a common unified method to
        configure locale and encoding.  This benefits users.
	Otherwise, users will have to remember the method to enable
	UTF-8 mode for each software. Some softwares need <tt>-u8</tt> 
	switch, other need X resource setting, other need 
	<tt>.foobarrc</tt> file, other need a special environmental 
	variable, other use UTF-8 for default.  It is nonsense!
  <item>The advancement of the OS means the advancement of the
        software.  Thus, you can use new locale without recompiling
	your software.
</list>
You can read the 
<url id="http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT" 
name="Unicode support in the Solaris Operating Environment"> whitepapaer
and understand the merit of this model.
<url id="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html"
name="Bruno Haible's Unicode HOWTO">
also recommends this model.
</P>

<sect id="localecategory">Locale Categories and <tt>setlocale()</tt></heading>

<P>
In LOCALE model, the behaviors of some C functions are dependent
on LOCALE environment.  LOCALE environment is divided
into six categories and each of these categories can
be set independently using <tt>setlocale()</tt>.
</P>

<P>
The followings are the six categories:
<taglist>
  <tag><strong>LC_CTYPE</strong>
       <item>
       <p>
       Category related to encodings.
       Characters which are encoded by LC_CTYPE-dependent encoding
       is called <strong>multibyte characters</strong>.
       Note that multibyte character doesn't need to be multibyte.
       </p>
       <p>
       LC_CTYPE-dependent functions are: character testing functions
       such as <tt>islower()</tt> and so on, multibyte character
       functions such as <tt>mblen()</tt> and so on, multibyte
       string functions such as <tt>mbstowcs()</tt> and so on,
       and so on.
       </p>
       </item>
  <tag><strong>LC_COLLATE</strong>
       <item>
       <p>
       Category related to sorting.
       <tt>strcoll()</tt> and so on are LC_COLLATE-dependent.
       </p>
       </item>
  <tag><strong>LC_MESSAGES</strong>
       <item>
       <p>
       Category related to the language for messages the software
       outputs.  This category is used for <prgn>gettext</prgn>.
       </p>
  <tag><strong>LC_MONETARY</strong>
       <item>
       <p>
       Category related to format to show monetary numbers,
       for example, currency mark, comma or period, columns,
       and so on.
       <tt>localeconv()</tt> is the only function which is
       LC_MONETARY-dependent.
       </p>
       </item>
  <tag><strong>LC_NUMERIC</strong>
       <item>
       <p>
       Category related to format to show general numbers,
       for example, character for decimal point.
       </p>
       <p>
       Formatted I/O functions such as <tt>printf()</tt>,
       string conversion functions such as <tt>atof()</tt>,
       and so on are LC_NUMERIC-dependent.
       </p>
       </item>
  <tag><strong>LC_TIME</strong>
       <item>
       <p>
       Category related to format to show time and date,
       such as name of months and weeks, order of date,
       month, and year, and so on.
       </p>
       <p>
       <tt>strftime()</tt> and so on are LC_TIME-dependent.
       </p>
       </item>
</taglist>
</p>

<p>
<tt>setlocale()</tt> is a function to set LOCALE.
Usage is char *<tt>setlocale(</tt>int <em>category</em>, const char 
*<em>locale</em><tt>);</tt>.  Header file of <tt>locale.h</tt>
is needed for prototype declaration and definition of
macros for category names.  For example,
<tt>setlocale(LC_TIME, "de_DE");</tt>.
</p>

<p>
For <em>category</em>, the following macros can be used:
LC_CTYPE, LC_COLLATE, LC_MONETARY, LC_NUMERIC, LC_TIME, and
LC_ALL.  For <em>locale</em>, specific locale name, <tt>NULL</tt>,
or <tt>""</tt> can be specified.
</p>

<p>
Giving <tt>NULL</tt> for <em>locale</em> will return the
current value of the specified locale category.  Otherwise,
<tt>setlocale()</tt> returns the newly set locale name,
or <tt>NULL</tt> for error.
</p>

<p>
Given <tt>""</tt> for <em>locale</em>, <tt>setlocale()</tt>
will determine the locale name in the following manner:
<list>
  <item>At first, consult <tt>LC_ALL</tt> environmental variable.
  <item>If <tt>LC_ALL</tt> is not available, consult environmental 
        variable same as the name of the locale category.
	For example, <tt>LC_COLLATE</tt>.
  <item>If none of them are available, consult <tt>LANG</tt> 
        environmental variable.
</list>
This is why a user is expected to set <tt>LANG</tt> variable.
In other words, all what a user has to do is to set <tt>LANG</tt>
variable so that all locale-compliant softwares work well for
desired way.
</p>

<p>
Thus, I recommend strongly to call <tt>setlocale(LC_ALL, "");</tt>
at the first of your softwares, if the softwares are to be 
international.
</p>

<sect id="localename">Locale Names</heading>

<P>
We can specify locale names for these six locale categories.
Then, which name should we specify?
</P>

<P>
The syntax to build a locale name is determined as follows:
<example>
  language[_territory][.codeset][@modifier]
</example>
where <em>language</em> is two lowercase alphabets described
in ISO639, such as <tt>en</tt> for English, <tt>eo</tt> for
Esperanto, and <tt>zh</tt> for Chinese, <em>territory</em>
is two uppercase alphabets described in ISO3166, such as
<tt>GB</tt> for United Kingdom, <tt>KR</tt> for Republic of
Korea (South Korea), <tt>CN</tt> for China.  There are no standard
for <em>codeset</em> and <em>modifier</em>.  GNU libc uses
<tt>ISO-8859-1</tt>, <tt>ISO-8859-13</tt>, <tt>eucJP</tt>,
<tt>SJIS</tt>, <tt>UTF8</tt>, and so on for <em>codeset</em>,
and <tt>euro</tt> for <em>modifier</em>.
</P>

<P>
However, it is depend on the system which locale names are valid.
In other words, you have to install <em>locale database</em> for
locale you want to use.  Type <tt>locale -a</tt> to display all
supported locale names on the system.
</P>

<p>
Note that locale names of <tt>"C"</tt> and <tt>"POSIX"</tt> are
determined for the names for default behavior.  For example,
when your software need to parse the output of <tt>date(1)</tt>,
you'd better call <tt>setlocale(LC_TIME, "C");</tt> before 
invocation of <tt>date(1)</tt>.
</p>

<sect id="wchar">Multibyte Characters and Wide Characters</heading>

<p>
Now we will concentrate on LC_CTYPE, which is the most important
category in six locale categories.
</p>

<p>
Many encodings such as ASCII, ISO 8859-*, KOI8-R, EUC-*,
ISO 2022-*, TIS 620, UTF-8, and so on are used widely in the world.
It is inefficient and a cause of bugs, even not impossible, for
every softwares to implement all these encodings.
Fortunately, we can use LOCALE technology to solve this problem.
<footnote>
  Usage of UCS-4 is the second best solution for this problem.
  Sometimes LOCALE technology cannot be used and UCS-4 is the
  best.  I will discuss this solution later.
</footnote>
</p>

<p>
<strong>Multibyte characters</strong> is a term to call characters
encoded in locale-specific encoding.  It is nothing special.
It is mere a word to call our daily encodings.  In ISO 8859-1 locale,
ISO 8859-1 is multibyte character.  In EUC-JP locale, EUC-JP
is multibyte character.  In UTF-8 locale, UTF-8 is multibyte character.
In short, multibyte character is defined by <tt>LC_CTYPE</tt> locale 
category.
Multibyte characters is used when your software inputs
or outputs text data from/to everywhere out of your software,
for example, standard input/output, display, keyboard, file,
and so on, as you are doing everyday.
<footnote>
 There are a few exceptions.  Compound text should be used for
 communication between X clients.  UTF-8 would be the standard
 for file names in Linux.
</footnote>
</p>

<p>
You can handle multibyte characters using ordinal <tt>char</tt>
or <tt>unsigned char</tt> types and ordinal character- and
string-oriented functions.  It is just like you used to do for
ASCII and 8bit encodings.
</p>

<p>
Then why we call it with a special term of <em>multibyte character</em>?
The answer is, ISO C specifies a set of functions which can handle
multibyte characters properly.  On the other hand, it is obvious that
usual C functions such as <tt>strlen()</tt> cannot handle multibyte
characters properly.
</p>

<p>
Then what is these functions which can handle multibyte characters
properly?  Please wait a minute. 
Multibyte character may be stateful or stateless and multibyte or
non-multibyte, since it includes all encodings ever used and will
be used on the earth.  Thus it is not convenient for internal processing.
It needs complex algorithm even for, for example, character
extraction from a string, addition and division of a string, 
or counting of number of character in a string.
Thus, <strong>wide characters</strong> should be used for internal
processing.  And, the main part of these C functions which can handle
multibyte characters are functions for interconversion between
multibyte characters and wide characters.
These functions are introduced later.  Note that you may
be able to do without these functions, since ISO C supplies
I/O functions with conversion.
</p>

<p>
Wide character is defined in ISO C
<list>
  <item>that all characters are expressed in fixed width of bits.
  <item>that it is stateless, i.e., it doesn't have shift states.
</list>
</p>

<p>
There are two types for wide characters: <tt>wchar_t</tt> and
<tt>wint_t</tt>.  <tt>wchar_t</tt> is a type which can contain
one wide character.  It is just like 'char' type can be used for
contain one character.  <tt>wint_t</tt> can contain one wide
character or <tt>WEOF</tt>, an substitution of <tt>EOF</tt>.
</p>

<p>
A string of wide characters is achieved by an array of <tt>wchar_t</tt>,
just like a string of characters is achieved by an array
of <tt>char</tt>.
</p>

<p>
There are functions for <tt>wchar_t</tt>, substitute for functions
for <tt>char</tt>.
<list>
  <item><tt>strcat()</tt>, <tt>strncat()</tt> -&gt;
        <tt>wcscat()</tt>, <tt>wcsncat()</tt>
  <item><tt>strcpy()</tt>, <tt>strncpy()</tt> -&gt;
        <tt>wcscpy()</tt>, <tt>wcsncpy()</tt>
  <item><tt>strcmp()</tt>, <tt>strncmp()</tt> -&gt;
        <tt>wcscmp()</tt>, <tt>wcsncmp()</tt>
  <item><tt>strcasecmp()</tt>, <tt>strncasecmp()</tt> -&gt;
        <tt>wcscasecmp()</tt>, <tt>wcsncasecmp()</tt>
  <item><tt>strcoll()</tt>, <tt>strxfrm()</tt> -&gt;
        <tt>wcscoll()</tt>, <tt>wcsxfrm()</tt>
  <item><tt>strchr()</tt>, <tt>strrchr()</tt> -&gt;
        <tt>wcschr()</tt>, <tt>wcsrchr()</tt>
  <item><tt>strstr()</tt>, <tt>strpbrk()</tt> -&gt;
        <tt>wcsstr()</tt>, <tt>wcspbrk()</tt>
  <item><tt>strtok()</tt>, <tt>strspn()</tt>, <tt>strcspn()</tt> -&gt;
        <tt>wcstok()</tt>, <tt>wcsspn()</tt>, <tt>wcscspn()</tt>
  <item><tt>strtol()</tt>, <tt>strtoul()</tt>, <tt>strtod()</tt> -&gt;
        <tt>wcstol()</tt>, <tt>wcstoul()</tt>, <tt>wcstod()</tt>
  <item><tt>strftime()</tt> -&gt;
        <tt>wcsftime()</tt>
  <item><tt>strlen()</tt> -&gt;
        <tt>wcslen()</tt>
  <item><tt>toupper()</tt>, <tt>tolower()</tt> -&gt;
        <tt>towupper()</tt>, <tt>towlower()</tt>
  <item><tt>isalnum()</tt>, <tt>isalpha()</tt>, <tt>isblank()</tt>,
	<tt>iscntrl()</tt>, <tt>isdigit()</tt>, <tt>isgraph()</tt>,
	<tt>islower()</tt>, <tt>isprint()</tt>, <tt>ispunct()</tt>,
	<tt>isspace()</tt>, <tt>isupper()</tt>, <tt>isxdigit()</tt> -&gt;
	<tt>iswalnum()</tt>, <tt>iswalpha()</tt>, <tt>iswblank()</tt>,
	<tt>iswcntrl()</tt>, <tt>iswdigit()</tt>, <tt>iswgraph()</tt>,
	<tt>iswlower()</tt>, <tt>iswprint()</tt>, <tt>iswpunct()</tt>,
	<tt>iswspace()</tt>, <tt>iswupper()</tt>, <tt>iswxdigit()</tt>
	(<tt>isascii()</tt> doesn't have its wide character version).
  <item><tt>memset()</tt>, <tt>memcpy()</tt>, <tt>memmove</tt>,
	<tt>memmove()</tt>, <tt>memchr()</tt> -&gt;
	<tt>wmemset()</tt>, <tt>wmemcpy()</tt>, <tt>wmemmove</tt>,
	<tt>wmemmove()</tt>, <tt>wmemchr()</tt> 
</list>
There are additional functions for <tt>wchar_t</tt>.
<list>
  <item><tt>wcwidth()</tt>, <tt>wcswidth()</tt>
  <item><tt>wctrans()</tt>, <tt>towctrans()</tt>
</list>
</p>

<p>
You cannot assume anything on the concrete value of <tt>wchar_t</tt>,
besides <tt>0x21</tt> - <tt>0x7e</tt> are identical to ASCII.
<footnote>
 Some of you may know GNU libc uses UCS-4 for the internal expression
 of <tt>wchar_t</tt>.  However, you should not use the knowledge.
 It may differ in other systems.
</footnote>
You may feel this limitation is too strong.  If you cannot do
under this limitation, you can use UCS-4 as the internal encoding.
In such a case, you can write your software emulating
the locale-sensible behavior using <tt>setlocale()</tt>,
<tt>nl_langinfo(CODESET)</tt>, and <tt>iconv()</tt>.  Consult
the section of <ref id="iconv">.  Note that it is generally
easier to use wide character than implement UCS-4 or UTF-8.
</p>

<p>
You can write wide character in the source code as <tt>L'a'</tt>
and wide string as <tt>L"string"</tt>.  Since the encoding
for the source code is ASCII, you can only write ASCII
characters.  If you'd like to use other characters, you should
use <prgn>gettext</prgn>.
</p>

<p>
There are two ways to use wide characters:
<list>
  <item>I/O is described using multibyte characters.  Inputed data
        are converted into wide character immediately after reading
        and data for output are converted from wide character to 
	multibyte character immediately before writing.  Conversion
	can be achieved using functions of <tt>mbstowcs()</tt>,
	<tt>mbsrtowcs()</tt>, <tt>wcstombs()</tt>, <tt>wcsrtombs()</tt>,
	<tt>mblen()</tt>, <tt>mbrlen()</tt>, <tt>mbsinit()</tt>,
	and so on.  
	Please consult the manual pages for these functions.
  <item>Wide characters are directly used for I/O, using 
	wide character functions such as <tt>getwchar()</tt>, 
	<tt>fgetwc()</tt>, <tt>getwc()</tt>,
	<tt>ungetwc()</tt>, <tt>fgetws</tt>, <tt>putwchar()</tt>,
	<tt>fputwc()</tt>, <tt>putwc()</tt>, and <tt>fputws()</tt>,
	formatted I/O functions for wide characters such as
	<tt>fwscanf()</tt>, <tt>wscanf()</tt>, <tt>swscanf()</tt>,
	<tt>fwprintf()</tt>, <tt>wprintf()</tt>, <tt>swprintf()</tt>,
	<tt>vfwprintf()</tt>, <tt>vwprintf()</tt>, and
	<tt>vswprintf()</tt>, and wide character identifier
	of <tt>%lc</tt>, <tt>%C</tt>, <tt>%ls</tt>, <tt>%S</tt>
	for conventional formatted I/O functions.
	By using this approach, you don't need to handle
	multibyte characters at all.
	Please consult the manual pages for these functions.
</list>
Though latter functions are also determined in ISO C,
these functions have became newly available since GNU libc 2.2.
(Of course all UNIX operating systems have all functions described
here.)
</p>

<p>
Note that very simple softwares such as <tt>echo</tt> doesn't
have to care about multibyte character. and wide characters.
Such software can input and output multibyte character as is.
Of course you may modify these softwares using wide characters.
It may be a good practice of wide character programming.
Examples of a fragment of source codes will be discussed in
<ref id="internal">.
</p>

<p>
There is an explanation of multibyte and wide characters also 
in Ken Lunde's "CJKV Information Processing" (p25).  However,
the explanation is entirely wrong.
</p>

<sect id="locale_unicode">Unicode and LOCALE technology</heading>

<p>
UTF-8 is considered as the future encoding and
many softwares are coming to support UTF-8.  Though some
of these softwares implement UTF-8 directly, I recommend
you to use LOCALE technology to support UTF-8.
</p>

<p>
How this can be achieved?  It is easy!  If you are a developer
of a software and your software has already written using LOCALE
technology, you don't have to do anything!
</p>

<p>
Using LOCALE technology benefits not only developers but also users.
All a user has to do is set locale environment properly.
Otherwise, a user has to remember the method to use UTF-8 mode
for each software.  Some softwares need <tt>-u8</tt> switch,
other need X resource setting, other need <tt>.foobarrc</tt>
file, other need a special environmental variable,
other use UTF-8 for default.  It is nonsense!
</p>

<p>
Solaris has been already developed using this model.
Please consult 
<url id="http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT" 
name="Unicode support in the Solaris Operating Environment"> whitepapaer.
</p>

<p>
However, it is likely that some of upstream developers of
softwares of which you are maintaining a Debian package refuses
to use <tt>wchar_t</tt> for some reasons, for example, that
they are not familiar with LOCALE programming, that they think
it is troublesome, that they are not keen on I18N, that it is much
easier to modify the software to support UTF-8 than to modify it
to use <tt>wchar_t</tt>, that the software must work even under
non-internationalized OS such as MS-DOS, and so on.
Some developers may think that support of UTF-8 is sufficient
for I18N.
<footnote>
 In such a case, do they think of abolishing support of 7bit or
 8bit non-multibyte encodings?  If no, it may be unfair that
 8bit language speakers can use both UTF-8 and conventional (local)
 encodings while speakers of multibyte languages, combining
 characters, and so on cannot use their popular locale encodings.
 I think such a software cannot be called "internationalized".
</footnote>
Even in such cases, you can rewrite such a software so that it 
checks <tt>LC_*</tt> and <tt>LANG</tt> environmental variables
to emulate the behavior of <tt>setlocale(LC_ALL, "");</tt>.  
You can also rewrite the software to call <tt>setlocale()</tt>,
<tt>nl_langinfo()</tt>, and <tt>iconv()</tt> so that the software
supports all encodings which the OS supports, as discussed later.
Consult
<url id="http://ffii.org/archive/mails/groff/2000/Oct/0056.html"
name="the discussion in the Groff mailing list on the support of
UTF-8 and locale-specific encodings">, mainly held by Werner
LEMBERG, an experienced developer of GNU roff, and Tomohiro KUBOTA,
the author of this document.
</p>



<sect id="iconv"><heading><tt>nl_langinfo()</tt> and <tt>iconv()</tt></heading>

<p>
Though ISO C defines extensive LOCALE-related functions,
you may want more extensive support.  You may also want
conversion between different encodings.
There are C functions which can be used for such purposes.
</p>

<p>
char *<tt>nl_langinfo(</tt>nl_item <em>item</em><tt>)</tt> is 
an XPG5 function to get LOCALE-related informations.  You can 
get the following informations using the following macros 
for <em>item</em> defined in <tt>langinfo.h</tt> header file:
<list>
  <item>names for days in week
	(<tt>DAY_1</tt> (Sunday), <tt>DAY_2</tt>, <tt>DAY_3</tt>,
	<tt>DAY_4</tt>, <tt>DAY_5</tt>, <tt>DAY_6</tt>, and <tt>DAY_7</tt>)
  <item>abbreviated names for days in week
	(<tt>ABDAY_1</tt> (Sun), <tt>ABDAY_2</tt>, <tt>ABDAY_3</tt>,
	<tt>ABDAY_4</tt>, <tt>ABDAY_5</tt>, <tt>ABDAY_6</tt>, and
	<tt>ABDAY_7</tt>)
  <item>names for months in year
	(<tt>MON_1</tt> (January), <tt>MON_2</tt>, <tt>MON_3</tt>,
	<tt>MON_4</tt>, <tt>MON_5</tt>, <tt>MON_6</tt>, <tt>MON_7</tt>,
	<tt>MON_8</tt>, <tt>MON_9</tt>, <tt>MON_10</tt>, <tt>MON_11</tt>,
	and <tt>MON_12</tt>)
  <item>abbreviated names for months in year
	(<tt>ABMON_1</tt> (January), <tt>ABMON_2</tt>, <tt>ABMON_3</tt>,
	<tt>ABMON_4</tt>, <tt>ABMON_5</tt>, <tt>ABMON_6</tt>, 
	<tt>ABMON_7</tt>, <tt>ABMON_8</tt>, <tt>ABMON_9</tt>, 
	<tt>ABMON_10</tt>, <tt>ABMON_11</tt>, and <tt>ABMON_12</tt>)
  <item>name for AM (<tt>AM_STR</tt>)
  <item>name for PM (<tt>PM_STR</tt>)
  <item>name of era (<tt>ERA</tt>)
  <item>format of date and time (<tt>D_T_FMT</tt>)
  <item>format of date and time (era-based) (<tt>ERA_D_T_FMT</tt>)
  <item>format of date (<tt>D_FMT</tt>)
  <item>format of date (era-based) (<tt>ERA_D_FMT</tt>)
  <item>format of time (24-hour format) (<tt>T_FMT</tt>)
  <item>format of time (am/pm format) (<tt>T_FMT_AMPM</tt>)
  <item>format of time (era-based) (<tt>ERA_T_FMT</tt>)
  <item>radix (<tt>RADIXCHAR</tt>)
  <item>thousands separator (<tt>THOUSEP</tt>)
  <item>alternative characters for numerics (<tt>ALT_DIGITS</tt>)
  <item>affirmative word (<tt>YESSTR</tt>)
  <item>affirmative response (<tt>YESEXPR</tt>)
  <item>negative word (<tt>NOSTR</tt>)
  <item>negative response (<tt>NOEXPR</tt>)
  <item>encoding (<tt>CODESET</tt>)
</list>
For example, you can get names for months and use them for
your original output algorithm.  <tt>YESEXPR</tt> and
<tt>NOEXPR</tt> are convenient for softwares expecting Y/N
answer from users.
</p>

<p>
<tt>iconv_open()</tt>, <tt>iconv()</tt>, and <tt>iconv_close()</tt>
are functions to perform conversion between encodings.
Please consult manpages for them.
</p>

<p>
Combining <tt>nl_langinfo()</tt> and <tt>iconv()</tt>,
you can easily modify Unicode-enabled software into locale-sensible
truly internationalized software.
</p>

<p>
At first, add a line of <tt>setlocale(LC_ALL, "");</tt> at the
first of the software.  If it returns non-NULL, enable UTF-8 mode
of the software.
<example>
int conversion = FALSE;
char *locale = setlocale(LC_ALL, "");
   :
   :
(original code to determine UTF-8 mode or not)
   :
   :
if (locale != NULL &amp;&amp; utf_mode == FALSE) {
    utf8_mode = TRUE;
    conversion = TRUE;
}
</example>
Then modify input routine as following:
<example>
#define INTERNALCODE "UTF-8"
if (conversion == TRUE) {
    char *fromcode = nl_langinfo(CODESET);
    iconv_t conv = iconv_open(INTERNALCODE, fromcode);
    (reading and conversion...)
    iconv_close(conv);
} else {
    (original reading routine)
}
</example>
Finally modify the output routine as following:
<example>
if (conversion == TRUE) {
    char *tocode = nl_langinfo(CODESET);
    iconv_t conv = iconv_open(tocode, INTERNALCODE);
    (conversion and writing...)
    iconv_close(conv);
} else {
    (original writing routine)
}
</example>
Note that whole reading should be done at once since
otherwise you may divide multibyte character.
You can consult the <tt>iconv_prog.c</tt> file
in the distribution of GNU libc for usage of <tt>iconv()</tt>.
</p>

<p>
Though <tt>nl_langinfo()</tt> is a standard function of XPG5
and GNU libc supports it, it is not very portable.  And more,
there are no standard for encoding names for
<tt>nl_langinfo()</tt> and <tt>iconv_open()</tt>.
If this is a problem, you can use Bruno Haible's 
<url id="http://clisp.cons.org/~haible/packages-libiconv.html"
name="libiconv">.  It has <tt>iconv()</tt>, <tt>iconv_open()</tt>,
and <tt>iconv_close()</tt>.  And more, it has <tt>locale_charset()</tt>,
a replacement of <tt>nl_langinfo(CODESET)</tt>.
</p>


<sect id="locale-limit"><heading>Limit of Locale technology</heading>

<P>
Locale model has a limit.  That is, it cannot handle two locales at
the same time.  Especially, it cannot handle relationship between two
locales at all.
</P>

<P>
For example, EUC-JP, ISO 2022-JP, and Shift-JIS are popular encodings
in Japan.  EUC-JP is the de-facto standard for UNIX systems,
ISO 2022-JP is the standard for Internet, and Shift-JIS is the
encoding for Windows and Macintosh.  Thus, Japanese people have to
handle texts with these encodings.  Text viewers such as <tt>jless</tt>
and <tt>lv</tt> and editors such as <tt>emacs</tt> can automatically
understand the encoding to be read.  You cannot write such a software
using Locale technology.
</P>



<chapt id="output"><heading>Output to Display</heading>

<P>
Here 'Output to Display' does not mean translation of messages using 
<prgn>gettext</prgn>.
I will concern on whether characters are correctly outputed so that
we can read it.  For example, install <package>libcanna1g</package> 
package and display
<tt>/usr/doc/libcanna1g/README.jp.gz</tt> on console or <prgn>xterm</prgn>
 (of course after
ungzipping).  This text file is written in Japanese but even Japanese
people can not read such a row of strange characters.  Which you would
prefer if you were a Japanese speaker, an English message which can be read
with a dictionary or such a row of strange characters which is 
a result of <prgn>gettext</prgn>ization?  
<footnote>
(Yes, there <em>are</em> ways to display Japanese characters 
correctly -- <prgn>kon</prgn> (in <package>kon2</package> package)
for console and <prgn>kterm</prgn> for X, and Japanese people are 
happy with <prgn>gettext</prgn>ized Japanese messages.)
</footnote>
</P>

<P>
Problems on displaying non-English (non-ASCII) characters 
are discussed below.
</P>



<sect id="output-console"><heading>Console Softwares</heading>

<P>
In this section, problems on displaying characters on 
<strong>console</strong> are discussed.
<footnote>
This section does not include problems on developing console;
This section includes problems on developing softwares which run
on console.
</footnote>
Here, console includes a bare <strong>Linux console</strong> including 
framebuffer and conventional version, special consoles such as 
<strong>kon2</strong>, <strong>jfbterm</strong>, <strong>chdrv</strong>, 
and so on constructed by special softwares, and X terminal emulators
such as <strong>xterm</strong>, <strong>kterm</strong>,
<strong>hanterm</strong>, <strong>xiterm</strong>, <strong>rxvt</strong>,
<strong>xvt</strong>, <strong>gnome-terminal</strong>, 
<strong>wterm</strong>, <strong>aterm</strong>, <strong>eterm</strong>,
and so on.  Remote environments via telnet and secure shell such as 
<strong>NCSA telnet</strong> for Macintosh and <strong>Tera Term</strong>
for Windows are also regarded as consoles.
</P>

<P>
The feature of console is that:
<list>
  <item>All what a software has to do is to send a correct encoding
        to standard output.  Softwares on console don't need to
	care about fonts and so on.
  <item>Fonts with fixed sizes are used.  The unit of the width
	of the font is called 'column'.  'Doublewidth' fonts, i.e.,
	fonts whose width is 2 columns, are used for CJK ideograms, 
	Japanese Hiragana and Katakana, Korean Hangul, and related 
	symbols.  Combined characters used for Thai and so on can be 
	regarded as 'zero'-column characters.  
</list>
</P>

<sect1 id="output-console-code"><heading>Encoding</heading>

<P>
Softwares running on the console are not responsible for displaying.
The console itself is responsible.  There are consoles
which can display encodings other than ASCII such as
<taglist>
 <tag>kon in kon2 package
      <item>EUC-JP, Shift-JIS, and ISO-2022-JP
 <tag>jfbterm
      <item>EUC-JP, ISO 2022-JP, and ISO 2022 (including any 94, 96,
            and 94x94 coded character sets whose fonts are available)
 <tag>kterm
      <item>EUC-JP, Shift-JIS, ISO 2022-JP, and ISO 2022 (including
            ISO8859-{1,2,3,4,5,6,7,8,9}, JISX 0201, JISX 0208, JISX 0212,
	    GB 2312, and KSC 5601)
 <tag>krxvt in rxvt-ml package
      <item>EUC-JP
 <tag>crxvt-gb in rxvt-ml package
      <item>CN-GB
 <tag>crxvt-big5 in rxvt-ml package
      <item>Big5
 <tag>cxtermb5 in cxterm-big5 package
      <item>Big5
 <tag>xcinterm-big5 in xcin package
      <item>Big5
 <tag>xcinterm-gb in xcin package
      <item>CN-GB
 <tag>xcinterm-gbk in xcin package
      <item>GBK
 <tag>xcinterm-big5hkscs in xcin package
      <item>Big5 with HKSCS
 <tag>hanterm
      <item>EUC-KR, Johab, and ISO 2022-KR
 <tag>xiterm and txiterm in xiterm+thai package
      <item>TIS 620
 <tag>xterm
      <item>UTF-8
</taglist>
However, there are no way for a software on console to know which
encoding is available.  I think it is a responsibility for
a user to properly set LC_CTYPE locale (i.e. LC_ALL, LC_CTYPE, or LANG 
environmental variable).  Provided LC_CTYPE locale is set properly,
a software can use it to know which encoding to be supported
by the console.
</P>

<P>
Concerning the translated messages by <prgn>gettext</prgn>,
the software does not need anything.  It works well if the
user properly set LC_CTYPE and LC_MESSAGES locale.
</P>

<P>
If you are handling a string in non-ASCII encoding (using
multibyte character, UTF-8 directly, and so on), you will have
to care about points which you don't have to care about if you are
using ASCII.
<list>
  <item>8-bit cleanness.  I think everyone understand this.
  <item>Continuity of multibyte characters.  In multibyte encodings
        such as EUC-JP and UTF-8, one character may consist
        from more than two bytes.  These bytes should be outputed
        continued.  Insertion of additional codes between the
	continuing bytes can break the character.  I have seen a
	software which outputs location control code everytime
	it outputs one byte.  It breaks multibyte character.
</list>
</P>

<sect1 id="output-console-column"><heading>Number of Columns</heading>

<P>
Internationalized console software cannot assume that a character
always occupy one column.  You can get the number of column of a
character of a string using <tt>wcwidth()</tt> and
<tt>wcswidth()</tt>.  Note that you have to use 
<tt>wchar_t</tt>-style programming since these functions have
a <tt>wchar_t</tt> parameter.
</P>

<P>
Additional cares have to be taken not to destroy multicolumn
characters.  For example, imagine your software displayed a
double-column character at (row, column) = (1, 1).  What will occur
when your software then display a single-column character at (row, column) 
= (1, 2) or at (1, 1) ?  The single-column character erases
the half of the double-column character?  Nobody knows the answer.
It depends on the implementation of the console.  All what I can
tell is that your software should avoid such cases.
</P>

<P>
If your software inputs a string from keyboard,  you will have to
take more cares.  All of numbers of characters, bytes, and columns
differ.  For example, in UTF-8 encoding, one character of
'a' with acute accent occupies two bytes and one column.  One
character of CJK-ideograph occupies three bytes and two columns.
For example, if the user types 'Backspace', how many backspace
code (0x08) should the software outputs?  How many bytes should
the software erase from the internal buffer?
Don't be nervous; you can use <tt>wchar_t</tt> which assures
one character occupy one <tt>wchar_t</tt> everytime and you can
use <tt>wcwidth()</tt> to know the number of columns.
Note that control codes such as 'backspace' (0x08) and so on are
column-oriented everytime.  It backs 'one' column even if the
character at the position is a doublewidth character.
</P>


<sect id="output-x"><heading>X Clients</heading>

<P>
The way to develop X clients can differ drastically dependent on
the toolkits to be used.  At first, Xlib-style programming is
discussed since Xlib is the fundamental for all other toolkits.
Then a few toolkits are discussed.
</P>

<sect1 id="output-x-xlib"><heading>Xlib programming</heading>

<P>
X itself is already internationalized.  X11R5 has introduced 
an idea of 'fontset' for internationalized text output.
Thus all what X clients have to do is to use the 'fontset'-related
functions.
</P>

<P>
The most important part for internationalization of displaying
for X clients is the usage of internationalized 
<strong>XFontSet</strong>-related functions introduced since
X11R5 instead of conventional <strong>XFontStruct</strong>-related
functions.
</P>

<P>
The main feature of XFontSet is that it can handle multiple fonts
at the same time.  This is related to the distinction between
coded character set (CCS) and character encoding scheme (CES)
which I wrote at the section of <ref id="coding-general-term">.
Some encodings in the world use multiple coded character
sets at the same time.  This is the reason we have to handle 
multiple X fonts at the same time.
<footnote>
Though UTF-8 is an encoding with single CCS, the current
version of XFree86 (4.0.1) needs multiple fonts to handle UTF-8.
</footnote>
</P>

<P>
Another significant feature of XFontSet is that it is 
locale (LC_CTYPE)-sensible.  This means that you have to
call <tt>setlocale()</tt> before you use XFontSet-related
functions.  And more, you have to specify the string you want
to draw as a multibyte character or a wide character.
</P>

<P>
In the conventional <tt>XFontStruct</tt> model, an X client 
opens a font using <tt>XLoadQueryFont()</tt>, draw a string
using <tt>XDrawString()</tt>, and close the font using
<tt>XFreeFont()</tt>.  On the other hand, in the internationalized
<tt>XFontSet</tt> model, an X client opens a font using
<tt>XCreateFontSet()</tt>, draw a string using <tt>XmbDrawString()</tt>,
and close the font using <tt>XFreeFontSet()</tt>.
The following are a concise list of substitution.
<list>
  <item><tt>XFontStruct</tt> -&gt; <tt>XFontSet</tt>
  <item><tt>XLoadQueryFont()</tt> -&gt; <tt>XCreateFontSet()</tt>
  <item>both of <tt>XDrawString()</tt> and <tt>XDrawString16</tt>
        -&gt; either of <tt>XmbDrawString()</tt> or <tt>XwcDrawString()</tt>
  <item>both of <tt>XDrawImageString()</tt> and <tt>XDrawImageString16</tt>
        -&gt; either of <tt>XmbDrawImageString()</tt> or 
	<tt>XwcDrawImageString()</tt>
</list>
Note that <tt>XFontStruct</tt> is usually used as a pointer, while 
<tt>XFontSet</tt> itself is a pointer.
</P>

<P>
Some people (ISO-8859-1-language speakers) may think that 
<tt>XFontSet</tt>-related functions are not 8-bit clean.  
This is wrong.  <tt>XFontSet</tt>-related
functions work according to <tt>LC_CTYPE</tt> locale.  The default 
LC_CTYPE locale uses ASCII.  Thus, if a user doesn't set <tt>LANG</tt>,
<tt>LC_CTYPE</tt>, nor <tt>LC_ALL</tt> environmental variable, 
<tt>XFontSet</tt>-related functions will use ASCII, i.e., not 8-bit 
clean.  The user has to set <tt>LANG</tt>, <tt>LC_CTYPE</tt>, or 
<tt>LC_ALL</tt> environmental variable properly (for example, 
<tt>LANG=en_US</tt>).
</P>

<P>
The upstream developers of X clients sometimes hate to enforce
users to set such environmental variables.
<footnote>
 IMHO, all users will have to set LANG properly when UTF-8 will
 become popular.
</footnote>
In such a case,
The X clients should have two ways to output text, i.e., 
<tt>XFontStruct</tt>-related conventional way and 
<tt>XFontSet</tt>-related internationalized way.  
If <tt>setlocale()</tt> returns <tt>NULL</tt>, <tt>"C"</tt>, 
or <tt>"POSIX"</tt>, use 
<tt>XFontStruct</tt> way.  Otherwise use <tt>XFontSet</tt> way. 
The author implemented this algorithm to a few window managers
such as TWM (version 4.0.1d), Blackbox (0.60.1), IceWM (1.0.0),
sawmill (0.28), and so on.
</P>

<P>
Window managers need more modifications related to inter-clients
communication.  This topic will be described later.
</P>

<sect1 id="output-x-aw"><heading>Athena widgets</heading>

<P>
Athena widget is already internationalized.
</P>

<P>***** Not written yet *****</P>

<sect1 id="output-x-gtk"><heading>Gtk and Gnome</heading>

<P>
Gtk is already internationalized.
</P>

<P>***** Not written yet *****</P>

<sect1 id="output-x-qt"><heading>Qt and KDE</heading>

<P>
Though internationalized version of Qt was available for a long
time, it could not be the official version of Qt.  The license
of Qt of those days inhibited to distribute internationalized
version of Qt.  However, Troll Tech at last changed their mind
and Qt's license and now the official version of Qt is 
internationalized.
</P>

<P>***** Not written yet *****</P>

<chapt id="input"><heading>Input from Keyboard</heading>

<P>
it is obvious that a text editor needs ability to input text 
from keyboard, otherwise the text editor is entirely useless.
Similarly, an internationalized text editor needs ability to input
characters used for various languages.  Other softwares such
as shells, libraries such as readline, environments such as
consoles and X terminal emulators, script languages such as perl, 
tcl/tk, python, and ruby, and application softwares such as 
word processors, draw and paints, file managers such as 
Midnight Commander, web browsers, mailers, and so on
also need ability to input internationalized text.  Otherwise
these softwares are entirely useless.
</P>

<P>
There are various languages in the world.  Thus, proper input methods
vary from languages to languages.
<list>
  <item>Some languages such as English doesn't need any special input
        methods.  All characters for the language can be inputted
	by a single key on a keyboard.  Keymap is all which a user
	has to care.
  <item>Some other languages such as German need a simple extension.
        For example, u with umlaut can be inputted with two strokes
	of ':' and 'u'.  A way to switch ordinal input mode (key
	strokes of ':' and 'u' inputs ':' and 'u') and the extension
	input mode (key strokes of ':' and 'u' bears u with umlaut)
	has to be supplied.  Almost languages in the world can be
	inputted with this method.
  <item>Other languages such as Chinese and Japanese need a complicated
	input method, since they use thousands of characters.
	Since it is very difficult and challenging problem to develop
	a clever input method, a few companies are developing Japanese
	input methods.  Typical Japanese input methods are shipped
	with tens of megabytes of conversion dictionary.
	It is often very troublesome to set up an input method for
	these languages.
	<footnote>
	 This is a field where proprietary systems such as MS Windows 
	 and Macintosh are much easier than free systems such as
	 Debian and FreeBSD.
	</footnote>
	You also have to be practiced to use 
	these input methods.
</list>
Different technologies are used for these languages.
The aim of this chapter is to introduce technologies for them.
</P>


<sect id="input-console"><heading>Non-X Softwares</heading>

<P>
Ideally, it is a responsibility for console and X terminal emulators
to supply an input method.  This situation is already achieved for
simple languages which don't need complicated input methods.
Thus, non-X softwares don't need to care about input methods.
</P>

<P>
There are a few Debian packages for consoles and X terminal 
emulators which supply input methods for particular languages.
<taglist>
  <tag><strong>xiterm</strong> in xiterm+thai package
       <item>Thai characters
  <tag><strong>hanterm</strong>
       <item>Korean Hangul
  <tag><strong>cxtermb5</strong> in cxterm-big5 package
       <item>Big5 traditional Chinese ideograms
  <tag><strong>cce</strong>
       <item>CN-GB simplified Chinese ideograms
</taglist>
And more, there are a few softwares which supply input methods for
existing console environment.
<taglist>
  <tag><strong>skkfep</strong>
       <item>Japanese (needs SKK as a conversion engine)
  <tag><strong>uum</strong>
       <item>Japanese (needs Wnn as a conversion engine; not
             avaliable as a Debian package)
  <tag><strong>canuum</strong>
       <item>Japanese (needs Canna as a conversion engine; not
             avaliable as a Debian package)
</taglist>
However, since input methods for complex languages have not been
available historically, a few non-X softwares have been developed
with input methods.
<taglist>
   <tag><strong>jvim-canna</strong>
        <item>A text editor which can input Japanese (needs Canna
	      as a conversion engine.)
   <tag><strong>jed-canna</strong>
        <item>A text editor which can input Japanese (needs Canna
	      as a conversion engine.)
   <tag><strong>nvi-m17n-canna</strong>
        <item>A text editor which can input Japanese (needs Canna
	      as a conversion engine.)
</taglist>
</P>

<P>
You have to take care of the differences between number of
<em>characters</em>, <em>columns</em>, and <em>bytes</em>.
For example, you can find immediately that <prgn>bash</prgn>
cannot handle UTF-8 input properly when you invoke <prgn>bash</prgn>
on UTF-8 Xterm and push BackSpace key.  This is because
<prgn>readline</prgn> always erase one column on the screen
and one byte in the internal buffer for one stroke of 'BackSpace'
key.  To solve this problem, <strong>wide character</strong>
should be used for internal processing.  One stroke of 'BackSpace'
should erase <tt>wcwidth()</tt> columns on the screen and
one <tt>wchar_t</tt> unit in the internal buffer.
</P>


<sect id="input-x"><heading>X Softwares</heading>

<P>
X11R5 is the first internationalized version of X Window System.
However, X11R5 supplied two sample implements of international
text input.  They are <strong>Xsi</strong> and <strong>Ximp</strong>.
Existence of two different protocols was an annoying situation.
However, X11R6 determined <strong>XIM</strong>, a new protocol
for internationalized text input, as the standard.  Internationalized
X softwares should support text input using XIM.
</P>

<P>
They are designed using <em>server-client</em> model.
The client calls the server when necessary.  The server
supplies conversion from key stroke to internationalized text.
</P>

<P>
<strong>Kinput</strong> and <strong>kinput2</strong>
are protocols for Japanese text input, which existed before X11R5.
Some softwares such as <prgn>kterm</prgn> and so on supports
kinput2 protocol.  <prgn>kinput2</prgn> is the server software.
Since the current version of <prgn>kinput2</prgn> supports XIM protocol,
you don't need to support kinput protocol.
</P>

<sect1 id="input-x-devel"><heading>Developing XIM clients</heading>

<P>***** Not written yet *****</P>

<P>
Development of XIM client is a bit complicated.  You can read
source code for <prgn>rxvt</prgn> and <prgn>xedit</prgn> to
study.
</P>

<sect1 id="input-x-examples"><heading>Examples of XIM softwares</heading>

<P>
The following are examples of softwares which can work as XIM clients.
<list>
  <item>X Terminal Emulators such as <prgn>krxvt</prgn>, <prgn>kterm</prgn>,
        and so on.
  <item>Text editors such as <prgn>xedit</prgn>, <prgn>gedit</prgn>, and
        so on.
  <item>Web rowser <prgn>mozilla</prgn>.
</list>
The following are examples of softwares which can work as XIM servers.
<list>
  <item><prgn>kinput</prgn> and <prgn>skkinput</prgn> for Japanese.
</list>
</P>

<sect id="input-emacs"><heading>Emacsen</heading>

<P>
<strong>GNU Emacs</strong> and <strong>XEmacs</strong> take
an entirely different model for international input.
</P>

<P>
They supply all input methods for various languages.
Instead of relying on console or XIM, they use these input
methods.  These input methods can be selected by
<tt>M-x set-input-method</tt> command.  The selected input
method can be switched on and off by <tt>M-x toggle-input-method</tt>
command.
</P>

<P>
GNU Emacs supplies input methods for
British, Catalan, 
Chinese (array30, 4corner, b5-quick, cns-quick, cns-tsangchi,
ctlau, ctlaub, ecdict, etzy, punct, punct-b5, py, py-b5,
py-punct, py-punct-b5, qj, qj-b5, sw, tonepy, ziranma, zozy),
Czech, Danish, Devanagari, Esperanto,
Ethiopic, Finnish, French, German, Greek, Hebrew, Icelandic,
IPA, Irish, Italian, Japanese (egg-wnn, skk), 
Korean (hangul, hangul3, hanja, hanja3), 
Lao, Norwegian, Portuguese, Romanian, Scandinavian,
Slovak, Spanish, Swedish, Thai, Tibetan, Turkish, Vietnamese,
Latin-{1,2,3,4,5},
Cyrillic (beylorussian, jcuken, jis-russian, macedonian,
serbian, transit, transit-bulgarian, ulrainian, yawerty),
and so on.
</P>











<chapt id="internal"><heading>Internal Processing and File I/O</heading>

<P>
There are many text-processing softwares, such as
<prgn>grep</prgn>,
<prgn>groff</prgn>,
<prgn>head</prgn>,
<prgn>sort</prgn>,
<prgn>wc</prgn>,
<prgn>uniq</prgn>,
<prgn>nl</prgn>,
<prgn>expand</prgn>,
and so on.
There are also many script languages which are often used for
text processing, such as
<prgn>sed</prgn>,
<prgn>awk</prgn>,
<prgn>perl</prgn>,
<prgn>python</prgn>,
<prgn>ruby</prgn>,
and so on.
These softwares need to be internationalized.
</P>

<P>
From a user's point of view, a software can use any internal encodings
if I/O is done correctly.  It is because a user cannot be aware of
which kind of internal code is used in the software.
</P>

<P>
There are two candidate for internal encoding.  One is
<strong>wide character</strong> and the another is <strong>UCS-4</strong>.
You can also use Mule-type encoding, where a pair of a number
to express CCS and a number to express a character consist a unit.
</P>

<P>
I recommend to use <em>wide character</em>, for reasons I alread
explained in <ref id="locale">, i.e., wide character can be
encoding-independent and can support various encodings in the
world including UTF-8, can supply a common united way for users
to choose encodings, and so on.
</P>

<P>
Here a few examples of handling of <tt>wchar_t</tt> are shown.
</P>


<sect id="internal-stream"><heading>Stream I/O of Characters</heading>

<P>
The following program is a small example of stream I/O of wide characters.
<example>
#include &lt;stdio.h&gt;
#include &lt;wchar.h&gt;
#include &lt;locale.h&gt;
main()
{
	wint_t c;

	setlocale(LC_ALL, "");
	while(1) {
		c = getwchar();
		if (c == WEOF) break;
		putwchar(c);
	}
}
</example>
I think you can easily imagine a corresponding version using <tt>char</tt>.
Since this software does not do any character manipulation, you can use
ordinal <tt>char</tt> for this software.
</P>

<P>
There are a few points.  At first, never forget to call
<tt>setlocale()</tt>.  Then, <tt>putwchar()</tt>,
<tt>getwchar()</tt>, and <tt>WEOF</tt> are the replacements of 
<tt>putchar()</tt>, <tt>getchar()</tt>, and <tt>EOF</tt>, respectively.
Use <tt>wint_t</tt> instead of <tt>int</tt> for <tt>getwchar()</tt>.
</P>


<sect id="internal-wc"><heading>Character Classification</heading>

<P>
Here is an example of character clasification using <tt>wchar_t</tt>.
At first, this is a non-internationalized version.
<example>
/*
 *  wc.c
 *
 *  Word Counter
 *
 */

#include &lt;stdio.h&gt;
#include &lt;string.h&gt;

int main(int argc, char **argv)
{
	int n, p=0, d=0, c=0, w=0, l=0;

	while ((n=getchar()) != EOF) {
		c++;
		if (isdigit(n)) d++;
		if (strchr(" \t\n", n)) w++;
		if (n == '\n') l++;
	}

	printf("%d characters, %d digits, %d words, and %d lines\n",
		c, d, w, l);
}
</example>
Here is the internationalized version.
<example>
/*
 *  wc-i.c
 *
 *  Word Counter (internationalized version)
 *
 */

#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
#include &lt;locale.h&gt;

int main(int argc, char **argv)
{
	int p=0, d=0, c=0, w=0, l=0;
	wint_t n;

	setlocale(LC_ALL, "");

	while ((n=getwchar()) != EOF) {
		c++;
		if (iswdigit(n)) d++;
		if (wcschr(L" \t\n", n)) w++;
		if (n == L'\n') l++;
	}

	printf("%d characters, %d digits, %d words, and %d lines\n",
		c, d, w, l);
}
</example>
</P>

<P>
This example shows that <tt>iswdigit()</tt> is used instead of
<tt>isdigit()</tt>.  And more, <tt>L"string"</tt> and <tt>L'char'</tt>
for wide character string and wide character.
</P>

<sect id="internal-length"><heading>Length of String</heading>

<P>
The following is a sample program to obtain the length of the
inputed string.  Note that number of bytes and number of characters
are not distinguished.
<example>
/* length.c
 *
 * a sample program to obtain the length of the inputed string
 * NOT INTERNATIONALIZED
 */

#include &lt;stdio.h&gt;
#include &lt;string.h&gt;

int main(int argc, char **argv)
{
	int len;

	if (argc &lt; 2) {
		printf("Usage: %s [string]\n", argv[0]);
		return 0;
	}
	
	printf("Your string is: \"%s\".\n", argv[1]);

	len = strlen(argv[1]);
	printf("Length of your string is: %d bytes.\n", len);
	printf("Length of your string is: %d characters.\n", len);
	printf("Width of your string is: %d columns.\n", len);
	return 0;
}
</example>
</P>

<P>
The following is a internationalized version of the program
using wide characters.
<example>
/* length-i.c
 *
 * a sample program to obtain the length of the inputed string
 * INTERNATIONALIZED
 */

#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
#include &lt;locale.h&gt;

int main(int argc, char **argv)
{
	int len, n;
	wchar_t *wp;

	/* All softwares using locale should write this line */
	setlocale(LC_ALL, "");

	if (argc &lt; 2) {
		printf("Usage: %s [string]\n", argv[0]);
		return 0;
	}
	
	printf("Your string is: \"%s\".\n", argv[1]);

	/* The concept of 'byte' is universal. */
	len = strlen(argv[1]);
	printf("Length of your string is: %d bytes.\n", len);

	/* To obtain number of characters, it is the easiest way */
	/* to convert the string into wide string.  The number of */
	/* characters is equal to the number of wide characters. */
	/* It does not exceed the number of bytes. */
	n = strlen(argv[1]) * sizeof(wchar_t);
	wp = (wchar_t *)malloc(n);
	len = mbstowcs(wp, argv[1], n);
	printf("Length of your string is: %d characters.\n", len);

	printf("Width of your string is: %d columns.\n", wcswidth(wp, len));

	return 0;
}
</example>
</P>

<P>
This program can count multibyte characters correctly.
Of course the user has to set LANG variable properly.
</P>

<P>
For example, on UTF-8 xterm...
<example>
$ export LANG=ko_KR.UTF-8
$ ./length-i (a Hangul character)
Your string is: "(the character)"
Length of your string is: 3 bytes.
Length of your string is: 1 characters.
Width of your string is: 2 columns.
</example>
</P>



<sect id="internal-extract"><heading>Extraction of Characters</heading>

<P>
The following program extracts all characters contained in the given
string.
<example>
/* extract.c
 *
 * a sample program to extract each character contained in the string
 * not internationalized
 */

#include &lt;stdio.h&gt;
#include &lt;string.h&gt;

int main(int argc, char **argv)
{
	char *p;
	int c;

	if (argc &lt; 2) {
		printf("Usage: %s [string]\n", argv[0]);
		return 0;
	}
	
	printf("Your string is: \"%s\".\n", argv[1]);

	c = 0;
	for (p=argv[1] ; *p ; p++) {
		printf("Character #%d is \"%c\".\n", ++c, *p);
	}
	return 0;
}
</example>
Using wide characters, the program can be rewritten as following.
<example>
/* extract-i.c
 *
 * a sample program to extract each character contained in the string
 * INTERNATIONALIZED
 */

#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
#include &lt;locale.h&gt;
#include &lt;stdlib.h&gt;

int main(int argc, char **argv)
{
	wchar_t *wp;
	char p[MB_CUR_MAX+1];
	int c, n, len;

	/* Don't forget. */
	setlocale(LC_ALL, "");

	if (argc &lt; 2) {
		printf("Usage: %s [string]\n", argv[0]);
		return 0;
	}
	
	printf("Your string is: \"%s\".\n", argv[1]);

	/* To obtain each character of the string, it is easy to convert */
	/* the string into wide string and re-convert each of the wide */
	/* string into multibyte characters. */
	n = strlen(argv[1]) * sizeof(wchar_t);
	wp = (wchar_t *)malloc(n);
	len = mbstowcs(wp, argv[1], n);
	for (c=0; c&lt;len; c++) {
		/* re-convert from wide character to multibyte character */
		int x;
		x = wctomb(p, wp[c]);
		/* One multibyte character may be two or more bytes. */
		/* Thus "%s" is used instead of "%c". */
		if (x&gt;0) p[x]=0;
		printf("Character #%d is \"%s\" (%d byte(s)) \n", c, p, x);
	}

	return 0;
}
</example>
</P>

<P>
Note that this program doesn't work well if the multibyte character
is stateful.
</P>









<chapt id="internet"><heading>the Internet</heading>

<P>
The Internet is a world-wide network of computer.
Thus the text data exchanged via the Internet must be
internationalized.
</P>

<P>
The concept of internationalization did not exist
at the dawn of the Internet, since it was developed in US.
Protocols used in the Internet were developed to be
upper-compatible with the existing protocols.
</P>

<P>
One of the key technology of the internationalization
of the Internet data exchange is <strong>MIME</strong>.
</P>

<sect id="mailnews"><heading>Mail/News</heading>

<P>
Internet mail uses SMTP (RFC 821) and ESMTP (RFC 1869) protocols.
SMTP is 7bit protocol and ESMTP is 8bit.
</P>

<P>
Original SMTP can only send ASCII characters.  Thus 
non-ASCII characters (ISO 8859-*, Asian characters, and so on)
have to be converted into ASCII characters.
</P>

<P>
MIME (RFC 2045, 2046, 2047, 2048, and 2049) deals with this problem.
</P>

<P>
At first RFC 2045 determines three new headers.
<list>
 <item>MIME-Version:
 <item>Content-Type:
 <item>Content-Transfer-Encoding:
</list>
Now <tt>MIME-Version</tt> is 1.0 and thus all MIME mails have
a header like this:
<example>
MIME-Version: 1.0
</example>
<tt>Content-Type</tt> describes the type of content.
For example, an usual mail with Japanese text has a header like that:
<example>
Content-Type: text/plain; charset="iso-2022-jp"
</example>
Available types are described in RFC 2046.
<tt>Content-Transfer-Encoding</tt> describes the way to
convert the contents. Available values are <tt>BINARY</tt>,
<tt>7bit</tt>, <tt>8bit</tt>, <tt>BASE64</tt>, and <tt>QUOTED-PRINTABLE</tt>.
Since SMTP cannot handle 8bit data, <tt>8bit</tt> and <tt>BINARY</tt>
cannot be used.  ESMTP can use them.
Base64 and quoted-printable are ways to convert 8bit data into 7bit
and 8bit data have to be converted using either of them to sent by SMTP.
</P>

<P>
RFC 2046 describes media type and sub type for 
<tt>Content-Type</tt> header. Available types are
<tt>text</tt>, <tt>image</tt>, <tt>audio</tt>, <tt>video</tt>,
and <tt>application</tt>.  Now we are interested in <tt>text</tt>
because we are discussing about i18n.
Sub types for <tt>text</tt> are <tt>plain</tt>, <tt>enriched</tt>,
<tt>html</tt>, and so on.  <tt>charset</tt> parameter can also be
added to specify encodings.
<tt>US-ASCII</tt>, <tt>ISO-8859-1</tt>, 
<tt>ISO-8859-2</tt>, ..., <tt>ISO-8859-10</tt> are defined by
RFC 2046 for <tt>charset</tt>.  This list can be added by writing
a new RFC.
<list>
 <item>RFC 1468 <tt>ISO-2022-JP</tt>
 <item>RFC 1554 <tt>ISO-2022-JP-2</tt> 
 <item>RFC 1557 <tt>ISO-2022-KR</tt>
 <item>RFC 1922 <tt>ISO-2022-CN</tt>
 <item>RFC 1922 <tt>ISO-2022-CN-EXT</tt>
 <item>RFC 1842 <tt>HZ-GB-2312</tt>
 <item>RFC 1641 <tt>UNICODE-1-1</tt>
 <item>RFC 1642 <tt>UNICODE-1-1-UTF-7</tt>
 <item>RFC 1815 <tt>ISO-10646-1</tt>
</list>
</P>

<P>
RFC 2045 and 2046 determine the way to write non-ASCII characters
in the main text of mail.  On the other hand, RFC 2047 describes
'encoded words' which is the way to write non-ASCII characters in the header.
It is like that:
<tt>=?</tt><var>encoding</var><tt>?</tt><var>conversion algorithm</var><tt>?</tt><var>data</var><tt>?=</tt>,
where <var>encoding</var> is selected from the list of <tt>charset</tt>
of <tt>Content-Type</tt> header, <var>algorithm</var> is <tt>Q</tt>
or <tt>q</tt> for quoted-printable or <tt>B</tt> or <tt>b</tt> for
base64, and <var>data</var> is encoded data whose length is less than
76 bytes.  If the <var>data</var> is longer than 75 bytes, 
it must be divided into multiple encoded words.
For example,
<example>
Subject: =?ISO-2022-JP?B?GyRCNEE7eiROJTUlViU4JSclLyVIGyhC?=
</example>
reads 'a subject written in Kanji' in Japanese (ISO-2022-JP,
encoded by base64).  Of course human cannot read it.
</P>


<sect id="www"><heading>WWW</heading>

<P>
WWW is a system that HTML documents (mainly; and files in other formats) 
are transferred using HTTP protocol.
</P>

<P>
HTTP protocol is defined by RFC 2068.
HTTP uses headers like mails and <tt>Content-Type</tt> header
is used to describe the type of the contents.
Though <tt>charset</tt> parameter can be described in the
header, it is rarely used.
</P>

<P>
RFC 1866 describes that the default encoding for HTML is
ISO-8859-1.  However, many web pages are written in,
for example, Japanese and Korean using (of course) encodings
different from ISO-8859-1.
Sometimes the HTML document describes:
<example>
&lt;META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022=jp"&gt;
</example>
which declares that the page is written in ISO-2022-JP.
However, there many pages without any declaration of encoding.
</P>

<P>
Web browsers have to deal with such a circumstance.
Of course web browsers have to be able to deal with every
encodings in the world which is listed in MIME.
However, many web browsers can only deal with ASCII
or ISO-8859-1.  Such web browsers are useless at all
for non-ASCII or non-ISO-8859-1 people.
</P>

<P>
URL should be written in ASCII character,
though non-ASCII characters can be expressed
using <tt>%</tt><var>nn</var> sequence where <var>nn</var>
is hexadecimal value.  This is because there are
no way to specify encoding. Wester-European people
would treat it as ISO-8859-1, while Japanese people
would treat it as EUC-JP or SHIFT-JIS.
</P>

















<chapt id="library"><heading>Libraries and Components</heading>


<P>
We sometimes use libraries and components which are not
very popular.  We may have to pay special attention for 
internationalization of these libraries and components.
</P>

<P>
On the other hand, we can use libraries and components
for improvement of internationalization.  This chapter
introduces such a libraries and components.
</P>

<sect id="gettext"><heading>Gettext and Translation</heading>

<P>
GNU Gettext is a tool to internationalize messages a software outputs
according to locale status of <tt>LC_MESSAGES</tt>.
A <prgn>gettext</prgn>ized software contains messages written in
various languages (according to available translators) and 
a user can choose them using environmental variables.
GNU gettext is a part of Debian system.
</P>

<P>
Install <package>gettext</package> package and read info pages for details.
</P>

<P>
Don't use non-ASCII characters for '<tt>msgid</tt>'.
Be careful because you may tend to use ISO-8859-1 characters.
For example, '&copy;' (copyright mark; you may be not able to
read the copyright mark NOW in THIS document) is non-ASCII character
(0xa9 in ISO-8859-1).
Otherwise, translators may feel difficulty to edit catalog files
because of conflict between encodings for <tt>msgid</tt> and in
<tt>msgstr</tt>.
</P>

<P>
Be sure the message can be displayed in the assumed environment.
In other words, you have to read the chapter of 'Output to Display' 
in this document and internationalize the output mechanism
of your software prior to <prgn>gettext</prgn>ization.
<em>ENGLISH MESSAGES ARE PREFERRED EVEN FOR NON-ENGLISH-SPEAKING PEOPLE,
THAN MEANINGLESS BROKEN MESSAGES.</em>
</P>

<P>
The 2nd (3rd, ...) byte of multibyte characters or 
all bytes of non-ASCII characters in stateful encodings
can be 0x5c (same to backslash in ASCII) or 0x22
(same to double quote in ASCII).
These characters have to properly escaped because
present version of GNU gettext doesn't care the 
'charset' subitem of '<tt>Content-Type</tt>' item for '<tt>msgstr</tt>'.
</P>

<P>
A <prgn>gettext</prgn>ed message must not used in multiple contexts.
This is because a word may have different meaning in different context.
For example, a verb means an order or a command if it appears
at the top of the sentence in English.  However, different languages
have different grammar.  If a verb is <prgn>gettext</prgn>ed and it is used
both in a usual sentence and in an imperative sentence,
one cannot translate it.
</P>


<P>
If a sentence is <prgn>gettext</prgn>ed, never divide the sentence.
If a sentence is divided in the original source code,
connect them so as to single string contains the full
sentence.  
This is because the order of words in a sentence
is different among languages.
For example, a routine
<example>
printf("There ");
switch(num_of_files) {
case 0:
        printf("are no files ");
        break;
case 1:
        printf("is 1 file ");
        break;
default:
        printf("are %d files ", num_of_files);
        break;
}
printf("in %s directory.\n", dir_name);
</example>
has to be written like that:
<example>
switch(num_of_files) {
case 0:
        printf("There are no files in %s directory", dir_name);
        break;
case 1:
        printf("There is 1 file in %s directory", dir_name);
        break;
default:
        printf("There are %d files in %s directory", num_of_files, dir_name);
        break;
}
</example>
before it is <prgn>gettext</prgn>ized.
</P>

<P>
A software with <prgn>gettext</prgn>ed messages should not depend on
the length of the messages.  The messages may get longer
in different language.
</P>

<P>
When two or more '%' directive for formatted output functions
such as <tt>printf()</tt> appear in a message,
the order of these '%' directives may be changed by
translation.  In such a case, the translator can specify
the order.
See section of 'Special Comments preceding Keywords'
in info page of <prgn>gettext</prgn> for detail.
</P>

<P>
Now there are projects to translate messages in various softwares.
For example, 
<url id="http://www.iro.umontreal.ca/~pinard/po/HTML/" 
name="Translation Project">.
</P>



<sect1 id="gettextize"><heading>Gettext-ization of A Software</heading>

<P>
At first, the software has to have the following lines.
<example>
int main(int argc, char **argv)
{
        ...
        setlocale (LC_ALL, "");   /* This is not for gettext but 
                                     all i18n software should have
                                     this line. */
        bindtextdomain (PACKAGE, LOCALEDIR);
        textdomain (PACKAGE);
        ...
}
</example>
where <var>PACKAGE</var> is the name of the catalog file and 
<var>LOCALEDIR</var> is <tt>"/usr/share/locale"</tt> for Debian.
<var>PACKAGE</var> and <var>LOCALEDIR</var> should be defined 
in a header file or <tt>Makefile</tt>.
</P>

<P>
It is convenient to prepare the following header file.
<example>
#include &lt;libintl.h&gt;
#define _(String) gettext((String))
</example>
and messages in source files should be written as
<tt>_("message")</tt>, instead of <tt>"message"</tt>.
</P>

<P>
Next, catalog files have to be prepared.
</P>

<P>
At first, a template for catalog file is prepared
using <prgn>xgettext</prgn>.
At default a template file <tt>message.po</tt> is
prepared.
<footnote>
I HAVE TO WRITE EXPLANATION.
</footnote>
</P>



<sect1 id="gettext-translate"><heading>Translation</heading>

<P>
Though <prgn>gettext</prgn>ization of a software is a temporal
work, translation is a continuing work because you have to 
translate new (or modified) messages when (or before) a new 
version of the software is released.
</P>


<sect id="readline"><heading>Readline Library</heading>

<P>***** Not written yet *****</P>

<P>
Readline library need to be internationalized.
</P>

<sect id="ncurses"><heading>Ncurses Library</heading>

<P>***** Not written yet *****</P>

<P>
<strong>Ncurses</strong> is a free implementation of curses library.
Though this library is now maintained by Free Software Foundation,
it is not covered by GNU General Public License. 
</P>

<P>
Ncurses library need to be internationalized.
</P>








<chapt id="otherlanguage"><heading>Softwares Written in Other than C/C++</heading>

<P>
Though C and C++ was, is, and will be the main language for 
software development for UNIX-like platforms, other languages,
especially scripting languages, are often used.
</P>

<P>
Generally, languages other than C/C++ have less support for I18N
then C/C++.  However, nowadays other languages than C/C++ are
coming to support Locale and Unicode.
</P>

<sect id="fortran"><heading>Fortran</heading>

<P>***** Not written yet *****</P>

<sect id="pascal"><heading>Pascal</heading>

<P>***** Not written yet *****</P>

<sect id="perl"><heading>Perl</heading>

<P>
Perl is one of the most important languages.  Indeed,
Debian system defines Perl as essential.
</P>

<P>
Perl 5.6 can handle UTF-8 characters.  Declaration of 
<tt>use utf8;</tt> will enable it.  For example,
<tt>length()</tt> will return the number of characters,
not the number of bytes.
</P>

<P>
However, it does not work well for me... why?
</P>

<P>***** Not written yet *****</P>

<sect id="python"><heading>Python</heading>

<P>***** Not written yet *****</P>

<sect id="ruby"><heading>Ruby</heading>

<P>***** Not written yet *****</P>

<sect id="tcltk"><heading>Tcl/Tk</heading>

<P>***** Not written yet *****</P>

<P>
Tcl/Tk is already internationalized.  It is locale-sensible.
It automatically uses proper font for various characters.
Though it uses UTF-8 as internal encoding, users of Tcl/Tk
don't have to aware of it.  This is because Tcl/Tk converts
encodings.
</P>

<sect id="java"><heading>Java</heading>

<p>
Full internationalization is naturally lead from
Java's "Write Once, Run Anywhere" principle.
To achieve this, Java uses Unicode as internal code
for <tt>char</tt> and <tt>String</tt>.  It is important
that Unicode is <em>internal</em> code.  Java obeys
the current LOCALE and encoding is automatically
converted for I/O.  Thus, <em>users</em> of applications written
in Java doesn't need to be aware of Unicode.
</p>

<p>
Then how about <em>developers</em>?  They also don't need
to be aware of the internal encoding.  Character processings
such as counting of number of characers in a string work well.
And more, you don't have to worry about display/input.
</p>

<p>
However, you may want to handle specified encodings for,
for example, MIME encoding/decoding.  For such purposes,
I/O can be done by specifying external encoding.
Check <tt>InputStreamReader</tt> and <tt>OutputStreamReader</tt>
classes.  You can also convert between the internal encoding
and specified encodings by 
<tt>String.getBytes(</tt><em>encoding</em><tt>)</tt> and
<tt>String(byte []</tt> <em>bytes</em><tt>, </tt><em>encoding</em><tt>)</tt>.
</p>




<sect id="shellscript"><heading>Shell Script</heading>

<P>***** Not written yet *****</P>

<sect id="lisp"><heading>Lisp</heading>

<P>***** Not written yet *****</P>











<chapt id="examples"><heading>Examples of I18N</heading>

<P>
Programmers who have internationalized softwares, have
written a patch of L10N, and so on are encouraged to contribute
to this chapter.
</P>



&twm;
&minicom;
&user-ja;
&fontset;








<chapt id="reference"><heading>References</heading>

<P>
General
<list>
 <item>
   <url id="http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT" 
   name="Unicode support in the Solaris Operating Environment">
   shows what is needed for software developers to support UTF-8.
 <item>
   <url id="http://www.unix-systems.org/version2/whatsnew/login_mse.html"
   name="The Open Group's summary of ISO C Amendment 1">
   is a detailed explanation on locale and wide character technologies.
 <item>
   <url id="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
   name="Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux">
   is a detailed explanation on UTF-8 and Unicode.  
 <item>
   <url id="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html"
   name="Bruno Haible's Unicode HOWTO">
 <item>
   <url id="http://surfchem0.riken.go.jp/~kubota/mojibake/"
   name="What is MOJIBAKE"> shows what occurs when character handling
   is improper.  Mojibake is a Japanese word which almost all computer
   users (not only Linux/BSD/Unix but also Windows/Macintosh) know.
 <item>
   <url id="http://i44www.info.uni-karlsruhe.de/~drepper/conf96/paper.html" 
   name="i18n in GNU Project">
 <item>
   <url id="http://cns-web.bu.edu/pub/djohnson/web_files/i18n/i18n.html" 
   name="Concept of C/UNIX i18n">
 <item>
   Ken Lunde, "CJKV Information Processing", ISBN 1-56592-224-7, 
   O'Reilly, 1999
 <item>
   Mikiko NISHIKIMI, Naoto TAKAHASHI, Satoru TOMURA, Ken'ichi HANDA,
   Seiji KUWARI, Shin'ichi MUKAIGAWA, and Tomoko YOSHIDA,
   "MARUCHIRINGARU KANKYOU NO JITSUGEN - X Window/Wnn/Mule/WWW BURAUZA
   DENO TAKOKUGO KANKYO" or "Realization of Multilingual Environment
   - Multilingual Environment in X Window/Wnn/Mule/WWW Browser"
   (in Japanese), ISBN4-88735-020-1, TOPPAN, 1996
 <item>
   Yoshihiro KIYOKANE and Youichi SUEHIRO,
   "KOKUSAIKA PUROGURAMINGU - I18N HANDOBUKKU" or "Internationalization
   Programming - I18N Handbook" (in Japanese), ISBN4-320-02904-6, 
   KYORITSU, 1998
 <item>
   Syuuji SADO and Tomoko YOSHIDA,
   "Linux/FreeBSD NIHONGO KANKYOU NO KOUCHIKU TO KATSUYOU" or
   "Construction and Utilization of Linux/FreeBSD Japanese Environment"
   (in Japanese), ISBN4-7973-0480-4, SOFTBANK, 1997
 <item>
   Kouichi YASUOKA and Motoko YASUOKA
   "MOJI KOODO NO SEKAI" or "The World of Character Codes" (in Japanese),
   ISBN4-501-53060-X, Tokyo Denki University Press Center, 1999
</list>
</P>

<P>
Characters (general)
<list>
 <item>
   <url id="http://www.kudpc.kyoto-u.ac.jp/~yasuoka/CJK.html"
   name="Character Tables">
   Graphic images for various character sets in the world.
 <item>
   <url id="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf"
   name="Ken Lunde's CJK info">
   information on CJK (Chinese, Japanese, and Korean) character 
   set standards, written by the writer of "CJKV Information Processing"
   published by O'Reilly.
 <item>
   <url id="http://www.isi.edu/in-notes/iana/assignments/character-sets"
   name="IANA character set registry">
   Note that both coded character sets (for example, KS_C_5601-1987, 
   MIBenum 36) and encodings (for example, ISO-2022-KR, MIBenum: 37) 
   are registered.  How confusing!
 <item>
   <url id="http://www.itscj.ipsj.or.jp/ISO-IR/"
   name="International Register of Coded Character Sets">
   A complete list of registered CCS, with ISO 2022 escape sequences.
   PDF files for these CCS are also available.
</list>
Characters (ISO 8859)
<list>
 <item>
   <url id="http://czyborra.com/charsets/iso8859.html"
   name="ISO 8859 Alphabet Soup">
 <item>
   <url id="http://park.kiev.ua/multiling/ml-docs/iso-8859.html"
   name="ISO 8859 Character Sets">
</list>
Characters (ISO 2022)
<list>
 <item>
   <url id="http://www.ewos.be/tg-cs/gconcept.htm">
 <item>
   <url id="http://www.ecma.ch/stand/ECMA-035.HTM">
</list>
Characters (ISO 10646 and Unicode)
<list>
 <item><url id="http://www.unicode.org/" name="Unicode Consortium">
 <item>
   <url id="http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html"
   name="Problems and Solutions for Unicode and User/Vendor Defined
   Characters">
</list>
</P>

<P>
Softwares
<list>
 <item>
   <url id="http://www.wg.omron.co.jp/~shin/Arena-CJK-doc/"
   name="Arena-i18n">
   Multilingual web browser.
 <item>
   <url id="http://www.mozilla.org/" name="Mozilla">
   is also a multilingual web browser.
 <item>
   <url id="http://www.m17n.org/mule/" name="Mule">
   Multilingual editor whose function is included in GNU Emacs 20
   and XEmacs 20.
   Mule is the most advanced m17n software in my knowledge.
 <item>
   <url id="http://www3.justnet.ne.jp/~nmasu/linux/jfbterm/indexn.html"
   name="JFBTERM"> (in Japanese) is a multilingual terminal for
   Linux framebuffer console.  Supported encodings are ISO 2022, EUC-JP,
   CN-GB, and EUC-KR.  Supported CCS are ISO 8859-{1,2,3,4,5,6,7,8,9,10},
   JISX 0201, JISX 0208, GB 2312, and KSX 1001.
 <item>
   <url id="http://turbolinux.com.cn/TLDN/chinese/project/unicon/"
   name="UNICON Project"> intends to implement display/input 
   CJK(Chinese/Japanese/Korean) characters under the Framebuffer under 
   Linux. 
 <item>
   <url id="http://programmer.lib.sjtu.edu.cn/cce/cce.html"
   name="CCE - Chinese Console Environment"> enables CN-GB Chinese
   to be displayed on Linux and FreeBSD console.  It also supplies
   input methods for Chinese.
 <item>
   <url id="http://dickey.his.com/xterm/"
   name="Xterm"> is a part of XFree86 distribution.  It can display
   UTF-8 encoding including doublewidth characters and combining
   characters.
 <item>
   <url id="http://www.rxvt.org/"
   name="Rxvt"> can display multibyte encodings such as EUC-JP,
   Shift-JIS, CN-GB, and Big-5.
 <item>
   <url id="http://clisp.cons.org/~haible/packages-libiconv.html"
   name="libiconv - character set conversion library"> provides
   <tt>iconv()</tt> implementation for systems which don't have one.
   It supports various encodings like ASCII, ISO 8859-*, KOI8-*,
   EUC-*, ISO 2022-*, Big5, Shift-JIS, TIS 620, UTF-*, UCS-*,
   CP*, Mac*, and so on.  This library also has <tt>locale_charset()</tt>,
   a replacement of <tt>nl_langinfo(CODESET)</tt>.
 <item>
   <url id="http://clisp.cons.org/~haible/packages-libutf8.html"
   name="libutf8 - a Unicode/UTF-8 locale plugin"> provides
   UTF-8 locale support for systems which don't have UTF-8 locales.
 <item>
   <url id="http://www.pango.org/" name="Pango"> is a project to
   develop a portable high-quality text rendering engine.
</list>
</P>

<P>
Projects and Organizations
<list>
 <item>
   <url id="http://www.li18nux.net/" 
   name="Linux Internationalization Initiative">, or Li18nux,
   focuses on the i18n of a core set of APIs and components of Linux
   distributions.  The results will be proposed to LSB.
 <item>
   <url id="http://www.li18nux.net/root/LI18NUX2000/LI18NUX-2000.html" 
   name="LI18NUX 2000 Globalization Specification"> is the first
   fruits of Li18nux.
   focuses on the i18n of a core set of APIs and components of Linux
   distributions.  The results will be proposed to LSB.
 <item>
   <url id="http://citrus.bsdclub.org/"
   name="Citrus Project"> is a project to implement
   locale/iconv for BSD series OSes so that these OSes conform to
   ISO C / SUSV2.
 <item>
   <url id="http://www.iro.umontreal.ca/~pinard/po/HTML/" 
   name="Translation Project">
 <item>
  <url id="http://www.mojikyo.gr.jp/" name="Mojikyo">
 <item>
  <url id="http://www.tron.org/index-e.html" name="TRON project">
</list>
<P>





</book>
</debiandoc>
