Reading and Writing Chinese Characters and Pinyin on the Web Using Unicode

Background

While working on my page on the Chinese calendar, I needed to put Chinese characters and pinyin on the web. The most common way to write Chinese characters on the web is to use Guobiao encoding for the Chinese characters. To put pinyin on the web, you can use one of the many special pinyin fonts or use numbers to indicate the tones as in Guo2biao3. I have instead decided to use Unicode rather than Guobiao encoding on my web pages. This has many advantages, and I believe that it will eventually become the standard. Unfortunately, there are some problems at the moment.

For XHTML 1.0, I set <?xml version="1.0" encoding="UTF-8"?>, and for HTML 4.0 I set <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> rather than charset=gb2312. Fortunately, the fonts in the language packs from Microsoft (MS Song - Simplified/Serif, MS Hei - Simplified/Gothic and MingLiU - Traditional) and the Office 2000 fonts (Simsun - Simplified and PMingLiU - Traditional) have both GB and Unicode encoding tables associated with it.

If the “Install On Demand” option is checked at Tools | Internet Options | Advanced, then you can simply select Chinese at View | Encoding and the fonts and code pages will be downloaded and installed automatically. Or you can go to Windows Update. Just select “Chinese (Simplified) Language Support” or “Chinese (Traditional) Language Support”.

If you use Netscape, you can search for the files ie3lpktw.exe for Traditional Chinese or ie3lpkcn.exe for Simplified Chinese. (It is 3L, not thirty-one).

Versions 2.76 or higher of Times New Roman, Arial and Courier New contain all the Pinyin vowels. They are available from the TrueType core fonts for the Web section of the Microsoft Typography site. The fonts in Microsoft's Simplified Chinese Language Pack also have them, but they display all accented letters as if they were followed by spaces. The reason for this is that the width of the accented vowels are aligned with the width of hanzi.

If you use Internet Explorer and have installed support for Chinese, it should be automatic. It will use a Chinese font (like MS Song) for the Chinese characters and a Latin font (like Times New Roman) for the rest. If your Latin font supports pinyin you're fine!

Part of the reason why IE can do this, is that it “cheats”. It doesn't consider Unicode as a codepage, but uses the fonts specified in the language settings. In Netscape, I go to Edit | Preferences | Fonts. There's an item for Unicode and I can select a suitable font. But in Internet Explorer, when I go to Tools | Options | Fonts, there's no item for Unicode. They have “Latin based” and “Chinese simplified” and so on. So instead of specifying one font for Unicode, I have to set each language separately. And if I want a language that IE hasn't heard about, or want to use symbols from Unicode, I may be in trouble.

Netscape can only use characters from a single encoding to display a Web page, and does not implement any alternative encoding that you select from the View menu if the page has a charset specified in a meta tag. It does not build Unicode from its constituent codepages but treats it just like other codepages. That's why Netscape can't use Times New Roman for the Latin text and pinyin and MS Song for the Chinese characters the way IE does.

You will have to go to Edit | Preferences | Fonts and select an appropriate fonts. If you have a Chinese Unicode font like Arial Unicode MS or Bitstream Cyberbit, you're OK. Just select that for Unicode. If not, choose a Chinese GB font like MS Song for Unicode. Unfortunately, this is not a good solution. The Latin characters in that font are not very pretty and the font leaves an extra space after the pinyin characters with tone marks, as explained above.

For more help on configuration, you can take a look at the page on Setting up Windows Internet Explorer 5, 5.5 and 6 for Multilingual and Unicode Support or Setting up Windows Netscape Browsers for Multilingual and Unicode Support, part of Alan Wood's Unicode Resources.

Testing Your Browser

Here's a word in GB (b´º) and Unicode (立春) and a 3rd tone in pinyin (ǎ) for testing purposes. If you can't read the first one, but the second says “beginning of spring”, you're OK for the Chinese characters. If the first one says “beginning of spring” and the second something weird, then you're not OK. If you're using MS Song to view the pinyin, there will be an extra space after the characters with tone marks, as explained above.

Most people should be able to see the pinyin, i.e., an “a” with a pointed upside down hat.

And if you don't know what “beginning of spring” means, you may want to read my paper about The Mathematics of the Chinese Calendar!

Writing Pinyin

When writing in pinyin, the 2nd and 4th tones are easy. They are just the acute and grave accents and are part of any standard font. But the 1st and 3rd tones and tone marks over the u with umlaut are harder. The 1st tone is called the macron and the 3rd tone the caron.

Warning: Do not try to use the breve (ă), a with round upside down hat, for the third tone; it doesn't look right (the upside down hat should be pointed) and the character is not part of MS Song. So if you are using Netscape with MS Song as your Unicode font, you will not see it.

I will demonstrate two methods for inputting pinyin.

You can also use a character map, like Character Agent, ListFont or International Character Code Map.

You may also find it convenient to use some of the keyboard utilities listed on Alan Wood's Unicode Resources.

Writing Chinese Characters

There are many ways to do this, but if you just need to write a few words, a simple solution is to use MS Word and the MS IME. Here are some links about the MS IME.

Save as plain text and choose the appropriate encoding, either UTF-8, GB2312 or Big5. Make sure that no characters appear in red! Then open the text files in your HTML editior. Depending on which one you use, you may or may not see the Chinese characters. I use Dreamweaver, and it works fine there, but if I open the text files in Notepad, the Chinese appear in the UTF-8 case, but not if I use GB or Big5.

If you're looking for a good Unicode editior, you may want to check out EmEditor. You can use the MS IME with it! They give out academic licenses for free!

Combining Simplified and Traditional Characters

With Unicode, this is a no-brainer. One of the main reasons why I use Unicode.

Character Code Conversion

You can do character code conversion with Chinese Encoding Converter at Erik E. Peterson's On-line Chinese Tools.

Finding Unicode Codes

I often need to know the Unicode code for Chinese characters, either for TeX or HTML. You can input the characters in MS Words and copy them into Chinese Character Dictionary - Unicode Version at Erik E. Peterson's On-line Chinese Tools. You have to select the box for showing Unicode Value in the results and select UTF-8, and not Unicode, for the input. The other version, Chinese Character Dictionary, will not work, since it does not have the UTF-8 option. To convert to octal, you can use Conversion Table - Decimal, Hexadecimal, Octal, Binary.

You can also use Convert characters to Unicode at pinyin.info.

Sources for Fonts

Belorussian translation

Belorussian translation provided by Movavi.

Links



Helmer Aslaksen
Department of Mathematics
National University of Singapore
helmer.aslaksen@gmail.com

Web Server Statistics for Helmer Aslaksen, produced by Analog.

Valid XHTML 1.0!

I use the W3C MarkUp Validation Service and the W3C Link Checker.