/[ddp]/manuals/trunk/intro-i18n/japanese-japan.sgml
ViewVC logotype

Contents of /manuals/trunk/intro-i18n/japanese-japan.sgml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 6992 - (hide annotations) (download) (as text)
Sat Nov 28 19:37:16 2009 UTC (3 years, 5 months ago) by osamu
File MIME type: text/x-sgml
File size: 23294 byte(s)
Retired kubota@debian.org
Deprecated network-administrator
1 kubota 845
2 kubota 857 <sect id="japanese"><heading>Japanese language / used in Japan</heading>
3 kubota 845
4    
5     <P>
6     This section is the text written by
7 osamu 6992 Tomohiro KUBOTA <email>kubota@debian.org</email> (no more reachable).
8 kubota 845 </P>
9    
10     <P>
11     Japanese is the only official language used in Japan.
12     People in Okinawa islands and Ainu ethnic group in Hokkaido region
13     have each language, though they are used among few number
14     of people and they don't have own letters.
15     </P>
16    
17     <P>
18     Japan is the only region where Japanese language is widely used.
19     </P>
20    
21    
22 kubota 857 <sect1 id="japanese-character"><heading>Characters used in Japanese</heading>
23 kubota 845
24     <P>
25     There are three kinds of characters used in Japan,
26     Hiragana, Katakana, and Kanji.
27     Arabic numerical characters (same as European languages) are
28     widely used in Japanese, though we have Kanji numerical characters.
29     Though Latin alphabets are not a part of Japanese characters,
30     they are widely used for proper nouns for companies and so on.
31     </P>
32    
33     <P>
34     Hiragana and Katakana are phonogram derived from Kanji.
35     Hiragana and Katakana characters have one-to-one correspondence
36     each other like upper and lower case of Latin alphabets.
37     However, <tt>toupper()</tt> and <tt>tolower()</tt> should not
38     convert Hiragana and Katakana each other.
39     Hiragana contains about 100 characters and of course Katakana does.
40 kubota 1053 (FYI: about 50 regular characters, 20 characters with voiced
41 kubota 845 consonant symbol, 5 characters with semi-voiced consonant symbol,
42     and 9 small characters.)
43     </P>
44    
45     <P>
46     Kanji is ideogram imported from China roughly about 1 - 2 thousands
47     years ago.
48     Nobody knows the whole number of Kanji and almost all of adult Japanese
49     people know several thousands of Kanji characters.
50     Though the origin of Kanji is Chinese character, shapes are
51     changed from original ancient Chinese Kanji.
52     Almost all Kanji have several ways to read, according to the
53     word the Kanji is contained.
54     </P>
55    
56    
57 kubota 857 <sect1 id="japanese-sets"><heading>Character Sets</heading>
58 kubota 845
59     <P>
60     JIS (Japan Industrial Standards) is an organization responsible
61 kubota 1054 for coded character sets (CCS) and encodings used in Japan.
62     The major coded character sets in Japan are:
63 kubota 845 <list>
64 kubota 857 <item>JIS X 0201-1976 Roman characters (Almost same to ASCII but 0x5c
65 kubota 845 is Yen mark instead of backslash and 0x7e is upper bar instead of tilde)
66 kubota 1054 <item>JIS X 0201-1976 Kana (about 60 KATAKANA characters),
67 kubota 857 <item>JIS X 0208-1997 1st and 2nd levels (about 7000 characters
68     including symbols, numeric characters, Latin, Cyrillic and
69     Greek alphabets, Japanese HIRAGANA, KATAKANA, and KANJI),
70 kubota 845 <item>JIS X 0212 (about 6000 characters including KANJI, which are not
71     included in JIS X 0208), and
72 kubota 1054 <item>JIS X 0213:2000 (aka JIS 3rd and 4th levels).
73 kubota 845 </list>
74     </P>
75    
76     <P>
77 kubota 1054 <strong>JIS X 0201 Roman</strong> is the Japanese version of ISO 646.
78     Though JIS X 0201 is included in SHIFT-JIS encoding (explained later) and
79     widely used for Windows/Macintosh, usage of this is not encouraged in UNIX.
80 kubota 845 </P>
81    
82     <P>
83 kubota 1054 <strong>JIS X 0201 Kana</strong> defines about 60 KATAKANA characters.
84     This is widely used by old 8bit computers.
85     In deed, SHIFT-JIS encoding was designed to be upward-compatible
86     with 8-bit encoding of JISX 0201 Roman and JISX 0201 Kana.
87     Note this CCS is not included in ISO 2022-JP encoding which is
88     used for e-mail and so on.
89 kubota 845 </P>
90    
91     <P>
92 kubota 1054 <strong>JIS X 0212</strong> is not widely used, probably because it cannot be
93     included in SHIFT-JIS, the standard encoding for Japanese version
94     of Windows and Macintosh. And more, this CCS may be obsolete
95     when JIS X 0213 will be popular, since JIS X 0213 has many
96     characters which are included in JIS X 0212.
97     However, the advantage of JIS X 0212 over JIS X 0213 is that
98     all characters in JIS X 0212 are included in the current
99     Unicode (version 3.0.1) while not all characters in JIS X 0213
100     are.
101     </P>
102    
103     <P>
104     <strong>JIS X 0208</strong> (aka JIS C 6226) is the main standard
105     for Japanese characters.
106 kubota 857 Strictly speaking, it was originally defined in 1978 and
107     revised on 1983, 1990, and 1997.
108     Though 1997 version has 77 more characters than original 1976 version
109     and shape of more than 200 characters are changed,
110     almost softwares don't have to care about the difference between them.
111 kubota 1054 However, be careful of that ISO-2022-JP encoding (explained below)
112 kubota 857 contains both JIS X 0208-1978 and JIS X 0208-1983.
113 kubota 1054 1978 version is called 'old JIS' and later is called 'new JIS'.
114     Characters in JIS X 0208 are divided into two levels, 1st and 2nd.
115     Old 8bit computers rarely implemented the 2nd level.
116 kubota 845 </P>
117    
118     <P>
119     Usage of numeric characters and Latin alphabets in JIS X 0208 is
120 kubota 1054 not encouraged because these characters are also included in ASCII
121     and JIS X 0201 Roman, either of which is included in all encodings.
122     When converting into Unicode, these characters are mapped into
123     'fullwidth forms'.
124 kubota 845 </P>
125    
126     <P>
127 kubota 1054 All of these coded character sets (except for JIS X 0213) are
128     included in Unicode 3.0.1. A part of JIS X 0213 characters are not
129     included in Unicode 3.0.1.
130 kubota 845 </P>
131    
132     <P>
133 kubota 1054 There are a few different tables for conversion between non-letter
134     characters in JIS X 0208 and Unicode. This is a problem because
135     this may deny 'round-trip compatiblilty'.
136     <url id="http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html"
137     name="Problems and Solutions for Unicode and User/Vendor Defined Characters">
138     discusses this problem in detail.
139 kubota 845 </P>
140    
141    
142 kubota 1054 <sect1 id="japanese-encodings"><heading>Encodings</heading>
143 kubota 845
144     <P>
145 kubota 1054 There are three popular encodings widely used in Japan.
146 kubota 845 <list>
147     <item>ISO-2022-JP (aka JIS code or JUNET code)
148     <list>
149     <item>stateful
150     <item>subset of 7bit version of ISO-2022, where ASCII,
151 kubota 1054 JIS X 0201-1976 Roman, JIS X 0208-1978,
152     and JIS X 0208-1983 are supported.
153 kubota 845 <item>7bit, which means the most significant bit (MSB) of each
154 kubota 1054 byte is always zero.
155 kubota 845 <item>used for e-mail and net-news and preferred for HTML.
156 kubota 1054 <item>Determined in RFC 1468.
157 kubota 845 </list>
158     <item>EUC-JP (Japanese version of Extended UNIX Code)
159     <list>
160     <item>stateless
161 kubota 1054 <item>an implementation of EUC where G0, G1, G2, and G3 are
162     ASCII, JIS X 0208, JIS X 0201 Kana, and JIS X 0212
163     respectively. There are many implementation which cannot
164     use JIS X 0201 Kana and JIS X 0212.
165 kubota 845 <item>8bit
166 kubota 1054 <item>preferred encoding for UNIX. For example, almost all Japanese
167 kubota 845 message catalogs for gettext is written in EUC-JP.
168 kubota 1054 <item>Japanese code is mapped in <tt>0xa0</tt> - <tt>0xff</tt>.
169     This is important
170 kubota 845 for programmer because one doesn't need to care there are
171     fake '\' or '/' (which can be treated in a special way in
172     various context) in the Japanese code.
173     </list>
174     <item>SHIFT-JIS (aka Microsoft Kanji Code)
175     <list>
176     <item>stateless
177     <item>NOT subset of ISO-2022
178     <item>8bit
179 kubota 1054 <item>JIS X 0201 Roman, JIS X 0201 Kana, and JIS X 0208
180     can be expressed, but JIS X 0212 cannot.
181     <item>The standard encoding for Windows/Macintosh. This makes
182     SHIFT-JIS the most popular encoding in Japan. Though MS
183 kubota 845 is thinking about transition to UNICODE, it is suspicious
184     that it can be done successfully.
185     </list>
186     </list>
187     </P>
188    
189     <P>
190 kubota 1054 <strong>ISO-2022-JP</strong> is a subset of 7bit version of ISO 2022,
191     where only G0 is used and G0 is assumed to be invoked into GL.
192     Character sets included in ISO-2022-JP are:
193 kubota 857 <list>
194     <item>ASCII (ESC 0x28 0x42),
195     <item>JIS X 0201-1976 Roman (ESC 0x28 0x4a),
196     <item>JIS X 0208-1978 (old JIS) (ESC 0x24 0x40), and
197     <item>JIS X 0208-1983 (new JIS) (ESC 0x24 0x42).
198     </list>
199     Note that JIS X 0208-1978 and JIS X 0208-1983 are almost identical
200     and ASCII and JIS X 0201-1976 Roman are also almost identical.
201 kubota 1054 A line (stream of bytes between 'newline' control code) must
202     start by ASCII status and to end by ASCII status.
203 kubota 857 See <ref id="iso-2022"> for detail.
204 kubota 845 </P>
205    
206     <P>
207 kubota 1054 <strong>ISO-2022-JP-2</strong> (RFC 1554) is a subset of 7bit version
208     of ISO 2022 and superset of ISO-2022-JP. Difference between ISO-2022-JP
209     and ISO-2022-JP-2 is that ISO-2022-JP-2 has more coded character sets
210     than ISO-2022-JP. Character sets included in ISO-2022-JP-2 are:
211 kubota 857 <list>
212     <item>ASCII (ESC 0x28 0x42)
213     <item>JIS X 0201-1976 Roman (ESC 0x28 0x4a),
214     <item>JIS X 0208-1978 (old JIS) (ESC 0x24 0x40),
215     <item>JIS X 0208-1983 (new JIS) (ESC 0x24 0x42),
216     <item>GB2312-80 (simplified Chinese) (ESC 0x24 0x41),
217     <item>KS C 5601 (Korean) (ESC 0x24 0x28 0x43),
218     <item>JIS X 0212-1990 (ESC 0x24 0x28 0x44),
219     <item>ISO 8859-1 (Latin-1) (ESC 0x2e 0x41), and
220     <item>ISO 8859-7 (Greek) (ESC 0x2e 0x46).
221     </list>
222 kubota 1054 Though JIS X 0212-1990 may sometimes be used, ISO-2022-JP-2
223 kubota 857 is rarely used.
224 kubota 845 </P>
225    
226     <P>
227 kubota 1054 <strong>ISO-2022-INT-1</strong> is a superset of ISO-2022-JP-2 which has
228 kubota 857 CNS 11643-1986-1 and CNS 11643-1986-2 (traditional Chinese).
229     </P>
230    
231     <P>
232 kubota 1054 <strong>EUC-JP</strong> is a version of EUC, where
233     G0 is ASCII, G1 is JIS X 0208, G2 is JIS X 0201 Kana, and
234     G3 is JIS X 0212. G2 and G3 are sometimes not implemented.
235     This is the most popular encoding for Linux/Unix.
236 kubota 857 See <ref id="euc"> for detail.
237 kubota 845 </P>
238    
239     <P>
240 kubota 1054 <strong>SHIFT-JIS</strong> is designed to be a superset of
241     encodings for old 8bit computers which includes JIS X 0201
242     Roman and JIS X 0201 Kana. <tt>0x20</tt> - <tt>0x7f</tt>
243     is JIS X 0201 Roman and <tt>0xa0</tt> - <tt>0xdf</tt> is
244     JIS X 0201 Kana. <tt>0x80</tt> - <tt>0x9f</tt> and <tt>0xe0</tt>
245     - <tt>0xff</tt> is the first byte of doublebyte characters.
246     The second byte is <tt>0x40</tt> - <tt>0x7e</tt> and
247     <tt>0x80</tt> - <tt>0xfc</tt>. This code space is used for JIS X 0208.
248 kubota 857 </P>
249    
250     <P>
251 kubota 845 UNICODE is not popular in Japan at all, probably because
252 kubota 857 conversion from these codes into Unicode is a bit difficult.
253 kubota 845 However MS Windows uses Unicode in a limited field, for example,
254     internal code for file names.
255 kubota 857 I guess more and more softwares will come to support
256     Unicode in the future.
257 kubota 845 </P>
258    
259 kubota 857 <P>
260 kubota 1054 You can convert files written in these encodings one another using
261     <package>nkf</package> or <package>kcc</package> package.
262     Using options <tt>-j</tt>, <tt>-s</tt>, and <tt>-e</tt>,
263     <prgn>nkf</prgn> convert a file into ISO-2022-JP (aka JIS),
264     SHIFT-JIS (aka MS-KANJI), and EUC-JP, respectively. Note that
265     difference between JIS X 0201 Roman and ASCII is ignored.
266     Though <prgn>nkf</prgn> can guess the encoding of
267     the input file, you can specify the encoding by command option.
268 kubota 857 This is because there are no algorithm to completely distinguish
269     EUC-JP and SHIFT-JIS, though <prgn>nkf</prgn> usually guesses
270 kubota 1054 correctly. <prgn>tcs</prgn> can also convert these encodings,
271     though without guessing input encoding.
272     Conversion between these encodings can be done with a simple
273 kubota 857 algorithm since all of them are based on the same character sets.
274 kubota 1054 You need a table for code conversion between these encodings and Unicode.
275 kubota 857 </P>
276 kubota 845
277    
278 kubota 1054 <sect1 id="japanese-how"><heading>How These Encodings Are Used --- Information for Programmers</heading>
279 kubota 857
280 kubota 845 <P>
281     Since EUC-JP is widely used for UNIX,
282     EUC-JP should be supported. Exceptions are shown below.
283     Of course direct implementation of knowledge on EUC-JP is not
284     encouraged. If you can implement without the knowledge by use
285     of <tt>wchar_t</tt> and so on, you should do so.
286     <list>
287     <item>the body of mail and news messages must be written in ISO-2022-JP.
288     <item>De-facto standard of ICQ is SHIFT-JIS.
289 kubota 1054 <item>WWW browser must recognize all encodings.
290 kubota 845 <item>Softwares which communicate with Windows/Macintosh should use
291     SHIFT-JIS.
292     <item>SHIFT-JIS is widely used for BBS. (BBS is a service like Compuserve).
293     <item>File names for Joliet-format CD-ROM used for Windows is written
294     in Unicode.
295     </list>
296     </P>
297    
298    
299    
300 kubota 857 <sect1 id="japanese-columns"><heading>Columns</heading>
301 kubota 845
302     <P>
303     In consoles which are able to display Japanese characters
304 kubota 1054 (kon, jfbterm, kterm, krxvt, and so on), characters in JIS X 0201
305     (Roman and Kana) occupy 1 column
306     and characters in JIS X 0208, JIS X 0212, and JIS X 0213 occupy 2 columns.
307 kubota 845 </P>
308    
309    
310    
311 kubota 857 <sect1 id="japanese-direction"><heading>Writing Direction and Combined Characters</heading>
312 kubota 845
313     <P>
314     Japanese language can be written in vertical direction. A line goes
315     downward and the row of lies goes from right to left. This direction
316     is the traditional style. For example, most Japanese books, magazines
317 kubota 857 and newspapers except for in the field of natural science
318     (or ones containing many Latin words or equations) are written
319 kubota 845 in vertical direction. Thus a word processor is strongly recommended
320 kubota 1054 to support this direction. DTP systems which don't support this
321     direction are almost useless.
322 kubota 845 </P>
323    
324     <P>
325     Japanese language can also written in the same direction to Latin
326     languages. Japanese books and magazines on science and technology
327     are written in this direction. It is enough for almost usual softwares
328     to support this direction only.
329     </P>
330    
331     <P>
332     A few Japanese characters have to have different fonts for vertical
333     direction. They are reasonable characters --- parentheses and
334     'long syllable' symbol whose shape is like dash in English or
335     mathematical 'minus' sign. Symbols equivalent to
336     period and comma also have different style for horizontal and vertical
337     direction.
338     </P>
339    
340     <P>
341     In Japan, Arabic numerical characters are widely used, like European
342 kubota 1054 languages, though we have Kanji (ideogram) numerical characters.
343     Latin characters
344 kubota 845 can also appear in Japanese texts. If a row of 1 - 3 (or 4) characters of
345     Arabic and Latin appear in Japanese vertical text, these characters
346     can be crowded into one column. If more characters appear (large numbers
347     or long words), the paper is rotated 90 degree in anticlockwise and
348 kubota 1054 the characters are written in European way. Sometimes Latin characters
349     which appears in vertical text are written in the same way as Japanese
350     character, i.e., vertical direction. This is not so strong
351 kubota 845 custom. Arabic and Latin characters can always be written in both
352     normal and rotated way in vertical text.
353     <footnote>
354 kubota 1054 I HAVE TO SHOW EXAMPLE USING GRAPHICS.
355 kubota 845 </footnote>
356 kubota 1054 DTP system should support all of them.
357 kubota 845 </P>
358    
359     <P>
360 kubota 1054 A version of Japanized TeX (developed by ASCII, a publishing company
361     in Japan) can use vertical direction. This can also
362 kubota 845 treat a page containing both vertical and horizontal texts.
363     </P>
364    
365    
366 kubota 857 <sect1 id="japanese-layout"><heading>Layout of Characters</heading>
367 kubota 845
368     <P>
369     In Japanese language, words are not separated by space and
370     a line can be broken anywhere, with a few exceptions, unlike
371     European languages. Thus hyphenation is not needed for Japanese.
372     </P>
373    
374     <P>
375     Characters like open parentheses cannot come to the end
376     of a line. Characters like close parentheses and sorts of
377     sentence-separating marks such as period and comma cannot
378 kubota 1054 come to the top of a line. This rule and processing is
379     called 'kinsoku' in Japanese.
380 kubota 845 </P>
381    
382     <P>
383     In European languages, a break of line is equivalent to a space.
384     In Japanese language, a break of line should be neglected.
385 kubota 1054 For example, when rendering an HTML file, line-breaking character
386     in the HTML source should not be converted into whitespace.
387 kubota 845 </P>
388    
389    
390    
391    
392 kubota 857 <sect1 id="japanese-lang"><heading>LANG variable</heading>
393 kubota 845
394     <P>
395 kubota 1054 Different value of <tt>LANG</tt> used for different encodings.
396 kubota 845 </P>
397    
398     <P>
399     Following values are used for EUC-JP.
400     <list>
401 kubota 1054 <item><tt>LANG=ja_JP.eucJP</tt> (major for Linux and *BSD)
402     <item><tt>LANG=ja_JP.ujis</tt> (used to be major for Linux)
403     <item><tt>LANG=ja_JP</tt> (because EUC-JP is the de-facto standard for UNIX;
404     not recommended)
405     <item><tt>LANG=ja</tt> (because EUC-JP is the de-facto standard for UNIX;
406     not recommended)
407 kubota 845 </list>
408     </P>
409    
410     <P>
411     <tt>LANG=ja_JP.jis</tt> is used for ISO-2022-JP (aka JIS code or JUNET code).
412     </P>
413    
414     <P>
415     <tt>LANG=ja_JP.sjis</tt> is used for SHIFT-JIS (aka Microsoft Kanji Code).
416     </P>
417    
418 kubota 857 <P>
419 kubota 1054 Setting LANG is not sufficient for a Japanese user who has just installed
420 kubota 857 Linux to get a minimal Japanese environment. There are several
421     books on establishing Japanese environment on Linux/BSD and
422     magazines on Linux often have feature articles on how to establish
423     Japanese environment. Nowadays many Japanized Linux distributions
424 kubota 1054 which are optimized so that many basic software can display and
425 kubota 857 input Japanese are popular.
426 kubota 1054 Debian GNU/Linux has <package>user-ja</package> (for potato) and
427     <package>language-env</package> (for woody and following versions)
428     packages to establish basic Japanese environment.
429 kubota 857 </P>
430 kubota 845
431    
432 kubota 857 <sect1 id="japanese-input"><heading>Input from Keyboard</heading>
433 kubota 845
434     <P>
435     Since Japanese characters cannot be inputed directly from a keyboard,
436     a software is needed to convert ASCII characters into Japanese.
437 kubota 1054 <prgn>WNN</prgn>, <prgn>Canna</prgn>, and <prgn>SKK</prgn> are popular
438     free softwares to input Japanese language. Though
439     <prgn>T-Code</prgn> is also available, it is difficult to use.
440 kubota 857 Since these adopt server/client model and implement their own protocols,
441 kubota 1054 we cannot input Japanese only with <package>wnn</package>,
442     <package>canna</package>, or <package>skk</package>
443     (and their depending packages).
444 kubota 845 </P>
445    
446     <P>
447 kubota 1054 In X Window System environment, <package>kinput2-*</package> and
448     <package>skkinput</package> packages
449 kubota 857 connects these protocols and XIM, which is the standard input
450     protocol for X. Kinput2 also has an original protocol and
451     <package>kterm</package> and so on can be a client of kinput2 protocol.
452 kubota 1054 Kinput2 protocol was developed before international standards such
453     as XIM (or Ximp or Xsi) became available.
454 kubota 857 </P>
455    
456     <P>
457 kubota 1054 On console, there are no standard and each software has to support
458 kubota 857 wnn and/or canna protocol. For example, <package>jvim-canna</package>,
459 kubota 1054 <package>xemacs21-mule-canna</package>, and emacs20 with
460 kubota 857 <package>emacs-dl-canna</package> or <package>emacs-dl-wnn</package>.
461     Thus the ways to operate are different between softwares.
462 kubota 1054 <package>skkfep</package> provides a general way to input Japanese
463     on console.
464 kubota 857 </P>
465    
466     <P>
467     Then the way to input Japanese is explained.
468     </P>
469    
470     <P>
471 kubota 845 Since almost Hiraganas and Katakanas represents a pair of a vowel
472     and a consonant with one character, we can input one Hiragana or
473     one Katakana with two Latin alphabets. A few Hiraganas and Katakanas
474     need one or three alphabets.
475     </P>
476    
477     <P>
478     Kanji is obtained by converting from Hiragana.
479     There are many Japanese words which are expressed by two or more Kanjis
480     and almost recent converting softwares can convert such words at a time.
481 kubota 1054 (Old softwares can convert one Kanji at a time. You must be patient
482     to use this way.)
483     Softwares with good grammar/context analyzer and large dictionary
484 kubota 845 can convert longer phrases or even a whole sentence at a time.
485     However, we usually have to select one Kanji or word from
486     candidates the software shows, because Japanese language has
487     many homophones. For example, 61 Kanjis whose readings are 'KAN'
488     and 6 words whose readings are 'KOUKOU' are registered in
489     dictionary of <tt>canna</tt>.
490     (Today, 2 Oct 1999, I saw a TV advertisement film of Japanese word processor
491 kubota 1054 which insists the software can correctly convert an input into
492 kubota 845 'a cafe which opened today', not 'a cafe which rotated today'.
493     Though Japanese word 'KAITEN' means both 'open (a shop)' and 'rotate',
494     the software knows it is more usual for a cafe to open than to rotate.)
495     </P>
496    
497     <P>
498     The conversion from Hiragana to Kanji needs a large dictionary which
499     contains the Kanji spelling and readings of Japanese major words and
500     conjugation or inflection. Thus proprietary softwares tend to
501     efficiently convert. They usually have dictionaries larger than
502     few megabytes. Some of these recent proprietary softwares
503     even analyze the topic or meaning of the inputed Hiragana sentence
504     and choose the most appropriate homophone, though they often choose
505     wrong ones.
506     </P>
507    
508     <P>
509 kubota 1054 Nowadays several proprietary conversion softwares such as ATOK, WNN6,
510 kubota 845 and VJE for Linux are sold in Japan.
511     </P>
512    
513     <P>
514 kubota 1054 Since it is complex and hard work for users to input Japanese characters,
515 kubota 845 we don't want to input Y (for YES) or N (for NO) in Japanese.
516     We prefer learning such basic English words to inputing Japanese
517     words by invoking conversion software, inputing Latin alphabetic
518     expression of Japanese, converting it into Hiragana, converting
519     it into Kanji, choosing the correct Kanji, determining the correct
520     Kanji, and ending the conversion software each time we need to
521     input yes or no or similar words.
522     </P>
523    
524    
525 kubota 857 <sect1 id="japanese-more"><heading>More Detailed Discussions</heading>
526 kubota 845
527 kubota 857 <sect2 id="japanese-width"><heading>Width of Characters</heading>
528 kubota 845
529     <P>
530     Different from European languages, Japanese characters should
531     written in a fixed width. Exceptions arises when two symbols
532     such as parentheses, periods and commas continue. Kerning
533     should be done for such cases if the software is a word processor.
534     A text editor need not.
535     </P>
536    
537 kubota 864 <sect2 id="japanese-ruby"><heading>Ruby</heading>
538 kubota 845
539     <P>
540     <strong>Ruby</strong> is a small (usually 1/2 in length and 1/4 in
541     area or a bit smaller)
542     characters written above (in horizontal direction) or at right side
543     (in vertical direction) of the main text. This is usually used to show
544     a reading of difficult Kanji.
545     </P>
546    
547     <P>
548     Japanized TeX can use ruby by using an extra macro. Word processors should
549     have Ruby faculty.
550     </P>
551    
552    
553 kubota 857 <sect2 id="japanese-case"><heading>Upper And Lower Cases</heading>
554 kubota 845
555     <P>
556     Japanese character does not have upper and lower case although
557     there two sets of phonograms, Hiragana and Katakana.
558     </P>
559    
560     <P>
561 kubota 1054 Thus <tt>tolower()</tt> and <tt>toupper()</tt> should not convert
562     between Hiragana and Katakana.
563 kubota 845 </P>
564    
565     <P>
566     Hiragana is used for usual text. Katakana is used mainly for
567     express foreign or imported words, for example, KONPYU-TA
568     for computer, MAIKUROSOFUTO for Microsoft, and AINSYUTAIN for Einstein.
569     </P>
570    
571    
572 kubota 857 <sect2 id="japanese-sort"><heading>Sorting</heading>
573 kubota 845
574     <P>
575     Phonograms (Hiragana and Katakana) have sorting order.
576     The order is same to defined in JIS X 0208, with a few exceptions.
577     </P>
578    
579     <P>
580     Ideograms (Kanji) sorting is difficult. They should be sorted
581 kubota 1054 by their reading but almost all kanji have a few readings according
582 kubota 845 to the context. So if you want to sort Japanese text, you will need
583     a dictionary of whole Japanese Kanji words. And more, a few
584     Japanese words written in Kanji have different readings with
585 kubota 1054 exactly same series of Kanjis, this can occur especially for names of
586     person.
587 kubota 845 So it is usual that addressbook databases have two 'name' columns,
588     one for Kanji expression and the other for Hiragana.
589     </P>
590    
591     <P>
592     I know no softwares which can sort Japanese words in perfect way,
593     including free and proprietary softwares.
594     </P>
595    
596    
597 kubota 864 <sect2 id="japanese-romaji"><heading> Ro-ma ji (Alphabetic expression of Japanese)</heading>
598 kubota 845
599     <P>
600     We have a phonetic alphabetic expression of Japanese, Ro-ma ji.
601     It has almost one-to-one correspondence to Japanese phonogram.
602     It can be used to display Japanese text on Linux console and
603     so on. Since Japanese have many homophones this expression
604     can be crabbed.
605     </P>
606    
607     <P>
608     There are several variants of Ro-ma ji.
609     </P>
610    
611     <P>
612     The first distinguishing point is on handling of long syllable.
613     For example, long syllable of 'E' is expressed in:
614     <list>
615     <item>'E' with caret,
616     <item>'E' with upper bar,
617     <item>only 'E' in which long syllable mark is ignored,
618     <item>'EE',
619     <item>and so on.
620     </list>
621     </P>
622    
623     <P>
624     The second distinguishing point is some special pairs
625     of vowel and consonant.
626     For example, Hiragana character for combination of 'T' and 'I' is
627     pronounced like 'CHI'.
628     <list>
629     <item>TI or CHI, as described above,
630     <item>TU or TSU,
631     <item>SI or SHI,
632     <item>HU or FU,
633     <item>WO or O,
634     <item>TYA or CHA, and
635     <item>N or M.
636     </list>
637     </P>
638    

  ViewVC Help
Powered by ViewVC 1.1.5