/[ddp]/manuals/trunk/intro-i18n/intro-i18n.sgml
ViewVC logotype

Diff of /manuals/trunk/intro-i18n/intro-i18n.sgml

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1037 by kubota, Mon Nov 13 07:27:21 2000 UTC revision 1038 by kubota, Wed Nov 15 10:00:55 2000 UTC
# Line 60  This document describes the basic ideas Line 60  This document describes the basic ideas
60  programmers and package maintainers of Debian GNU/Linux and  programmers and package maintainers of Debian GNU/Linux and
61  other UNIX-like platforms.  other UNIX-like platforms.
62  The aim of this document is to offer an introduction to  The aim of this document is to offer an introduction to
63  basic concepts, character codes, and points of which care  basic concepts, character codes, and points of which care
64  should be taken when one writes an I18N-ed software or  should be taken when one writes an I18N-ed software or
65  an I18N patch for an existing software.  This document  an I18N patch for an existing software.  There are many
66  also tries to introduce the real state and existing  know-hows and case-studies on internationalization of
67  problems for each language and country.  softwares.  This document also tries to introduce the
68    real state and existing problems for each language and country.
69  </P>  </P>
70    
71  <P>  <P>
# Line 425  and so on. Line 426  and so on.
426    
427  <sect id="coding-general"><heading>General Discussions</heading>  <sect id="coding-general"><heading>General Discussions</heading>
428    
429  <sect1 id="codeset"><heading>Basic Terminology</heading>  <sect1 id="coding-general-term"><heading>Basic Terminology</heading>
430    
431  <P>  <P>
432  At first I begin this chapter by defining a few very important word.  At first I begin this chapter by defining a few very important word.
# Line 1750  these functions have became newly availa Line 1751  these functions have became newly availa
1751  here.)  here.)
1752  </p>  </p>
1753    
1754    <p>
1755    Note that very simple softwares such as <tt>echo</tt> doesn't
1756    have to care about multibyte character. and wide characters.
1757    Such software can input and output multibyte character as is.
1758    </p>
1759    
1760    
1761  <sect id="locale_unicode">Unicode and LOCALE technology</heading>  <sect id="locale_unicode">Unicode and LOCALE technology</heading>
1762    
# Line 1934  in the distribution of GNU libc for usag Line 1941  in the distribution of GNU libc for usag
1941  </p>  </p>
1942    
1943    
 <p>  
 <strong>****** I HAVE REWRITTEN THIS DOCUMENT UNTIL HERE ******</strong>  
 </p>  
   
1944    
1945    
1946  <chapt id="output"><heading>Output to Display</heading>  <chapt id="output"><heading>Output to Display</heading>
1947    
1948  <P>  <P>
1949  Here 'Output to Display' does not mean I18N of messages using  Here 'Output to Display' does not mean translation of messages using
1950  <prgn>gettext</prgn>.  <prgn>gettext</prgn>.
1951  I will concern on whether characters are correctly outputed so that  I will concern on whether characters are correctly outputed so that
1952  we can read it.  For example, install <package>libcanna1g</package>  we can read it.  For example, install <package>libcanna1g</package>
# Line 1955  people can not read such a row of strang Line 1958  people can not read such a row of strang
1958  prefer if you were a Japanese speaker, an English message which can be read  prefer if you were a Japanese speaker, an English message which can be read
1959  with a dictionary or such a row of strange characters which is  with a dictionary or such a row of strange characters which is
1960  a result of <prgn>gettext</prgn>ization?  a result of <prgn>gettext</prgn>ization?
1961  (Yes, there <em>is</em> a way to display  <footnote>
1962  Japanese characters correctly -- <prgn>kon</prgn> (in <package>kon2</package>  (Yes, there <em>are</em> ways to display Japanese characters
1963  package)  for console and <prgn>kterm</prgn> for X, and  correctly -- <prgn>kon</prgn> (in <package>kon2</package> package)
1964  Japanese people are happy with <prgn>gettext</prgn>ized Japanese messages.)  for console and <prgn>kterm</prgn> for X, and Japanese people are
1965    happy with <prgn>gettext</prgn>ized Japanese messages.)
1966    </footnote>
1967  </P>  </P>
1968    
1969  <P>  <P>
1970  Problems on displaying non-English characters are discussed below.  Problems on displaying non-English (non-ASCII) characters
1971  Since the mother tongue of the author is Japanese, the content may  are discussed below.
 be biased to Japanese.  
1972  </P>  </P>
1973    
1974    
# Line 1972  be biased to Japanese. Line 1976  be biased to Japanese.
1976  <sect id="output-console"><heading>Console Softwares</heading>  <sect id="output-console"><heading>Console Softwares</heading>
1977    
1978  <P>  <P>
1979  Softwares running on the console are not responsible for displaying.  In this section, problems on displaying characters on
1980  The console itself is responsible.  There are terminal emulators  <strong>console</strong> are discussed.
1981  which can display non-English languages such as <prgn>kterm</prgn>  <footnote>
1982  (EUC-JP, Shift JIS Japanese, ISO 2022 international),  This section does not include problems on developing console;
1983  <prgn>krxvt</prgn>, <prgn>grxvt</prgn>, and <prgn>crxvt</prgn>  This section includes problems on developing softwares which run
1984  (Japanese, Greek, and Chinese, included  on console.
1985  in <package>rxvt-ml</package> package), <prgn>cxterm</prgn>  </footnote>
1986  (Chinese, Korean, and Japanese, non-free), <prgn>hanterm</prgn>  Here, console includes a bare <strong>Linux console</strong> including
1987  (Korean)  framebuffer and conventional version, special consoles such as
1988  and so on and softwares with which non-English characters can be  <strong>kon2</strong>, <strong>jfbterm</strong>, <strong>chdrv</strong>,
1989  displayed on console such as <package>kon2</package> (Japanese)  and so on constructed by special softwares, and X terminal emulators
1990  and <package>jfbterm</package> (EUC-based character codes and  such as <strong>xterm</strong>, <strong>kterm</strong>,
1991  ISO2022-based character codes).  <strong>hanterm</strong>, <strong>xiterm</strong>, <strong>rxvt</strong>,
1992    <strong>xvt</strong>, <strong>gnome-terminal</strong>,
1993    <strong>wterm</strong>, <strong>aterm</strong>, <strong>eterm</strong>,
1994    and so on.  Remote environments via telnet and secure shell such as
1995    <strong>NCSA telnet</strong> for Macintosh and <strong>Tera Term</strong>
1996    for Windows are also regarded as consoles.
1997  </P>  </P>
1998    
1999  <P>  <P>
2000  All what a software on console/terminal-emulators  The feature of console is that:
2001  has to do is that output a correct code to the console.  <list>
2002      <item>All what a software has to do is to send a correct character
2003            code to standard output.  Softwares on console don't need to
2004            care about fonts and so on.
2005      <item>Fonts with fixed sizes are used.  The unit of the width
2006            of the font is called 'column'.  'Doublewidth' fonts, i.e.,
2007            fonts whose width is 2 columns, are used for CJK ideograms,
2008            Japanese Hiragana and Katakana, Korean Hangul, and related
2009            symbols.  Combined characters used for Thai and so on can be
2010            regarded as 'zero'-column characters.
2011    </list>
2012  </P>  </P>
2013    
2014  <P>  <sect1 id="output-console-code"><heading>Character Code</heading>
 At first, it is important not to destroy string data.  
 Sometimes it can be done only by 8bit-clean-ize.  
 '8bit-clean' means that the software does not destroy the  
 most significant bit (MSB) of data the software treats.  
 </P>  
2015    
2016  <P>  <P>
2017  Next, be careful for a software which sends control codes such  Softwares running on the console are not responsible for displaying.
2018  as location every time it output 1 byte.  Such codes destroy  The console itself is responsible.  There are consoles
2019  the continuity of multibyte character.  which can display character codes other than ASCII such as
2020    <taglist>
2021     <tag>kon2
2022          <item>EUC-JP, Shift-JIS, and ISO-2022-JP
2023     <tag>jfbterm
2024          <item>EUC-JP, ISO-2022-jp, and ISO-2022 (including any 7bit and
2025                8bit character sets whose fonts are available)
2026     <tag>kterm
2027          <item>EUC-JP, Shift-JIS, ISO-2022-JP, and ISO-2022 (including
2028                ISO8859-{1,2,3,4,5,6,7,8,9}, JISX0201, JISX0208, JISX0212,
2029                GB2312, and KSC5601)
2030     <tag>krxvt
2031          <item>EUC-JP
2032     <tag>crxvt-gb
2033          <item>CN-GB
2034     <tag>crxvt-big5
2035          <item>Big5
2036     <tag>hanterm
2037          <item>EUC-KR, Johab, and ISO-2022-KR
2038     <tag>xiterm+thai
2039          <item>TIS620
2040     <tag>xterm
2041          <item>UTF-8
2042    </taglist>
2043    However, there are no way for a software on console to know which
2044    character code is available.  I think it is a responsibility for
2045    a user to properly set LC_CTYPE locale (i.e. LC_ALL, LC_CTYPE, or LANG
2046    environmental variable).  Provided LC_CTYPE locale is set properly,
2047    a software can use it to know which character code to be supported
2048    by the console.
2049  </P>  </P>
2050    
2051  <P>  <P>
2052  Be also careful for destruction of multicolumn characters.  Concerning the translated messages by <prgn>gettext</prgn>,
2053  For example, when a string exceeds the width of the console,  the software does not need anything.  It works well if the
2054  the string is divided at the end of the line.  Terminal emulators  user properly set LC_CTYPE and LC_MESSAGES locale.
 should have a faculty to avoid such a 'excess of line width' type  
 destruction of character but so far no terminal emulators  
 have such a faculty.  (Only one exception --- shell mode of Emacs.  
 However, unfortunately shell mode of Emacs is a dumb terminal and  
 many softwares cannot be run on it.)  Thus each software on  
 console should be careful.  
2055  </P>  </P>
2056    
2057  <P>  <P>
2058  There is another reason to destroy multicolumn characters.  If you are handling a string in non-ASCII encoding (using
2059  When a message is overwritten on another string, a part  multibyte character, UTF-8 directly, and so on), you will have
2060  of a character which is a part of a previous string can be  to care about points which you don't have to care about if you are
2061  left not overwritten.  This may be more troublesome than many  using ASCII.
2062  people would think because multicolumn character can be  <list>
2063  written at every columns, not only at the multiple of the    <item>8-bit cleanness.  I think everyone understand this.
2064  width of the character.    <item>Continuity of multibyte characters.  In multibyte character
2065            codes such as EUC-JP and UTF-8, one character may consist
2066            from more than two bytes.  These bytes should be outputed
2067            continued.  Insertion of additional codes between the
2068            continuing bytes can break the character.  I have seen a
2069            software which outputs location control code everytime
2070            it outputs one byte.  It breaks multibyte character.
2071    </list>
2072  </P>  </P>
2073    
2074    <sect1 id="output-console-column"><heading>Number of Column</heading>
2075    
2076  <P>  <P>
2077  These destruction of continuity of multibyte characters may  Internationalized console software cannot assume that a character
2078  be a cause of the destruction of the whole line following  always occupy one column.  You can get the number of column of a
2079  the character.  Whether this can occur depends on the internal  character of a string using <tt>wcwidth()</tt> and
2080  implementation of console program.  This can occur if the  <tt>wcswidth()</tt>.  Note that you have to use
2081  terminal emulator does not treat columns, bytes and characters  <tt>wchar_t</tt>-style programming since these functions have
2082  properly separately.  The shell mode of Emacs is the only example  a <tt>wchar_t</tt> parameter.
 doing that but there are no chance to overwrite character on  
 the shell mode of Emacs, because it is a dumb terminal.  
2083  </P>  </P>
2084    
2085  <P>  <P>
2086  There are no standards for number of columns a character occupies.  Additional cares have to be taken not to destroy multicolumn
2087  This can be a large problem for softwares with <tt>ncurses</tt>.  characters.  For example, imagine your software displayed a
2088  There is no 'right' way to solve this.  Each software has to  double-column character at (row, column) = (1, 1).  What will occur
2089  have an information for each character set.  Consult section  when your software then display a single-column character at (row, column)
2090  <ref id="languages">  = (1, 2) or at (1, 1) ?  The single-column character erases
2091  for each language.  Take care of the distinction between number  the half of the double-column character?  Nobody knows the answer.
2092  of columns, bytes, and characters.  For subset of EUC-JP  It depends on the implementation of the console.  All what I can
2093  (ASCII alphabets and JIS X 0208 kanji), number of bytes and columns  tell is that your software should avoid such cases.
 are equal (1-byte character occupy 1 column and 2-byte character  
 occupy 2 columns).  Note that cursor-moving control characters  
 such as 'BS' (0x08) moves cursor one COLUMN, not one CHARACTER.  
2094  </P>  </P>
2095    
2096  <P>  <P>
2097  Another important point is that the string has to be converted  If your software inputs a string from keyboard,  you will have to
2098  into a character code which the console can understand.  So far there  take more cares.  All of numbers of characters, bytes, and columns
2099  are no consoles which understand Unicode.  differ.  For example, in UTF-8 character code, one character of
2100    'a' with acute accent occupies two bytes and one column.  One
2101    character of CJK-ideograph occupies three bytes and two columns.
2102    For example, if the user types 'Backspace', how many backspace
2103    code (0x08) should the software outputs?  How many bytes should
2104    the software erase from the internal buffer?
2105    Don't be nervous; you can use <tt>wchar_t</tt> which assures
2106    one character occupy one <tt>wchar_t</tt> everytime and you can
2107    use <tt>wcwidth()</tt> to know the number of columns.
2108    Note that control codes such as 'backspace' (0x08) and so on are
2109    column-oriented everytime.  It backs 'one' column even if the
2110    character at the position is a doublewidth character.
2111  </P>  </P>
2112    
2113    
   
2114  <sect id="output-x"><heading>X Clients</heading>  <sect id="output-x"><heading>X Clients</heading>
2115    
2116  <P>  <P>
# Line 2070  functions. Line 2121  functions.
2121  </P>  </P>
2122    
2123  <P>  <P>
2124  An X font is related to a specific <em>character set</em>.  The  The most important part for internationalization of displaying
2125  conventional font-related functions can use one font at the same time.  for X clients is the usage of internationalized
2126  However, text is expressed in a specific <em>character code</em> and  <strong>XFontSet</strong>-related functions introduced since
2127  some character codes need multiple character sets.  Chinese, Japanese,  X11R5 instead of conventional <strong>XFontStruct</strong>-related
2128  and Korean are languages which need multiple character sets.  functions.
 These languages cannot be displayed using the conventional  
 font-related functions.  
2129  </P>  </P>
2130    
2131  <P>  <P>
2132  'fontset' is an idea that multiple fonts are selected and construct  The main feature of XFontSet is that it can handle multiple fonts
2133  a set of fonts.  Using fontset enables to display international texts.  at the same time.  This is related to the distinction between
2134    coded character set (CCS) and character encoding scheme (CES)
2135    which I wrote at the section of <ref id="coding-general-term">.
2136    Some character codes in the world use multiple coded character
2137    sets at the same time.  This is the reason we have to handle
2138    multiple X fonts at the same time.
2139    <footnote>
2140    Though UTF-8 is a character code with single CCS, the current
2141    version of XFree86 (4.0.1) needs multiple fonts to handle UTF-8.
2142    </footnote>
2143  </P>  </P>
2144    
2145  <P>  <P>
2146  Here is a list of structure and functions of conventional 'Font'-related  Another significant feature of XFontSet is that it is
2147  and internationalized 'FontSet'-related.  Consult manpages for detail.  locale (LC_CTYPE)-sensible.  This means that you have to
2148  <example>  call <tt>setlocale()</tt> before you use XFontSet-related
2149  Font              | FontSet  functions.  And more, you have to specify the string you want
2150  ==================+====================  to draw as a mulbibyte character or a wide character.
 XFontStruct       | XFontSet  
 ------------------+--------------------  
 XLoadFont()       | XCreateFontSet()  
 ------------------+--------------------  
 XUnloadFont()     | XFreeFontSet()  
 ------------------+--------------------  
 XQueryFont()      | XFontsOfFontSet()  
 ------------------+--------------------  
 XDrawString() and | XmbDrawString() or  
 XDrawString16()   | XwcDrawString()  
 ------------------+--------------------  
 XDrawText() and   | XmbDrawText() or  
 XDrawText16()     | XwcDrawText()  
 ------------------+--------------------  
 </example>  
2151  </P>  </P>
2152    
2153  <P>  <P>
2154  If a software uses the left-hand functions it have to be rewritten  In the conventional <tt>XFontStruct</tt> model, an X client
2155  using the corresponding right-hand functions in the table.  Note that  opens a font using <tt>XLoadQueryFont()</tt>, draw a string
2156  this table is not perfect but only for an example.  Since these  using <tt>XDrawString()</tt>, and close the font using
2157  right-hand functions use wide characters and multibyte characters  <tt>XFreeFont()</tt>.  On the other hand, in the internationalized
2158  in C, <tt>setlocale()</tt> has to be called in advance.  <tt>XFontSet</tt> model, an X client opens a font using
2159    <tt>XCreateFontSet()</tt>, draw a string using <tt>XmbDrawString()</tt>,
2160    and close the font using <tt>XFreeFontSet()</tt>.
2161    The following are a concise list of substitution.
2162    <list>
2163      <item><tt>XFontStruct</tt> -&gt; <tt>XFontSet</tt>
2164      <item><tt>XLoadQueryFont()</tt> -&gt; <tt>XCreateFontSet()</tt>
2165      <item>both of <tt>XDrawString()</tt> and <tt>XDrawString16</tt>
2166            -&gt; either of <tt>XmbDrawString()</tt> or <tt>XwcDrawString()</tt>
2167      <item>both of <tt>XDrawImageString()</tt> and <tt>XDrawImageString16</tt>
2168            -&gt; either of <tt>XmbDrawImageString()</tt> or
2169            <tt>XwcDrawImageString()</tt>
2170    </list>
2171    Note that <tt>XFontStruct</tt> is usually used as a pointer, while
2172    <tt>XFontSet</tt> itself is a pointer.
2173  </P>  </P>
2174    
2175  <P>  <P>
2176  Some people (ISO-8859-1-language speakers) may think that  Some people (ISO-8859-1-language speakers) may think that
2177  XFontSet is not 8-bit clean.  This is wrong.  XFontSet-related  <tt>XFontSet</tt>-related functions are not 8-bit clean.
2178  functions work according to LC_CTYPE locale.  The default LC_CTYPE  This is wrong.  <tt>XFontSet</tt>-related
2179  locale uses ASCII.  Thus, if a user doesn't set <tt>LANG</tt>,  functions work according to <tt>LC_CTYPE</tt> locale.  The default
2180    LC_CTYPE locale uses ASCII.  Thus, if a user doesn't set <tt>LANG</tt>,
2181  <tt>LC_CTYPE</tt>, nor <tt>LC_ALL</tt> environmental variable,  <tt>LC_CTYPE</tt>, nor <tt>LC_ALL</tt> environmental variable,
2182  XFontSet will use ASCII, i.e., not 8-bit clean.  The user  <tt>XFontSet</tt>-related functions will use ASCII, i.e., not 8-bit
2183  has to set <tt>LANG</tt>, <tt>LC_CTYPE</tt>, or <tt>LC_ALL</tt>  clean.  The user has to set <tt>LANG</tt>, <tt>LC_CTYPE</tt>, or
2184  environmental variable properly (for example, <tt>LANG=en_US</tt>).  <tt>LC_ALL</tt> environmental variable properly (for example,
2185    <tt>LANG=en_US</tt>).
2186  </P>  </P>
2187    
2188  <P>  <P>
2189  The upstream developers of X clients sometimes hate to enforce  The upstream developers of X clients sometimes hate to enforce
2190  users to set such environmental variables.  In such a case,  users to set such environmental variables.
2191    <footnote>
2192     IMO, all users will have to set LANG properly when UTF-8 will
2193     become popular.
2194    </footnote>
2195    In such a case,
2196  The X clients should have two ways to output text, i.e.,  The X clients should have two ways to output text, i.e.,
2197  XFontStruct-related conventional way and XFontSet-related  <tt>XFontStruct</tt>-related conventional way and
2198  internationalized way.  If <tt>setlocale()</tt> returns  <tt>XFontSet</tt>-related internationalized way.
2199  <tt>NULL</tt>, <tt>"C"</tt>, or <tt>"POSIX"</tt>, use  If <tt>setlocale()</tt> returns <tt>NULL</tt>, <tt>"C"</tt>,
2200  XFontStruct way.  Otherwise use XFontSet way.  This algorithm  or <tt>"POSIX"</tt>, use
2201  is adopted by <package>Blackbox</package> (0.60.1 or later).  <tt>XFontStruct</tt> way.  Otherwise use <tt>XFontSet</tt> way.
2202    The author implemented this algoritym to a few window managers
2203    such as TWM (version 4.0.1d), Blackbox (0.60.1), IceWM (1.0.0),
2204    sawmill (0.28), and so on.
2205  </P>  </P>
2206    
2207  <P>  <P>
2208  The same problem exists for softwares using toolkits such as  Window managers need more modifications related to inter-clients
2209  athena, GTK+, Qt, and so on.  communication.  This topic will be described later.
2210  </P>  </P>
2211    
2212    
2213    
2214    
2215    
2216    <p>
2217    <strong>****** I HAVE REWRITTEN THIS DOCUMENT UNTIL HERE ******</strong>
2218    </p>
2219    
2220    
2221    
2222    
2223    
2224    
2225    
2226    
2227    
2228    
2229    
2230    
2231    
2232    

Legend:
Removed from v.1037  
changed lines
  Added in v.1038

  ViewVC Help
Powered by ViewVC 1.1.5