Programming | Blog: timdream

East Asian Input Methods are not hard to understand, but for English speakers, we can do better than a general explanation on Wikipedia with a concrete example. This post is me attempting to do that for English audiences.

It is written for my own amusement, but I hope you would like it. My recommendation is to read it once without tapping on footnotes and skim through it again with footnotes.

Spoken languages are humans conveying ideas by making sounds. A human can only make a limited number of sounds, due to anatomy. Not every sound humans can make is used for a given spoken language.

There are around 44 sounds in spoken English. Linguists call them phonemes. English is usually written in Latin alphabets. There are 26 alphabets. To represent 44 sounds, combinations of alphabets are utilized. Linguists call these alphabet combinations graphemes. Each English phoneme is being represented by many graphemes (too many for some, without a spelling reform). Humans are taught to pick the right grapheme combinations to write down the exact words they intend to speak. It’s called spelling.

Now, imagine there is a language¹ written using a different script. Instead of Latin alphabets, it would traditionally be written with graphemes composed of distinct shapes of drawings². Linguists call these shapes monograms. This particular imagery script comes with tens of thousands of these monograms, with the same relationship with the phonemes of the imagery language like English — just as humans are taught to spell English words correctly, they are taught to pick the right monogram combinations to write down the exact words they intend to speak.

Most humans were born with ten fingers. Modern computer keyboards come with around 78 keys, designed for the ten fingers to operate. This is enough for Latin alphabets given that there are only 26 of them, but far from enough for the monograms of the imagery language. Something would have to be done with that.

Thankfully, as a human language, the number of sounds of the imagery language would still be within a manageable magnitude. Before the introduction of modern computers, local linguists would have already identified these phonemes. They would have gone afar and invented a set of symbols for these sounds. These symbols — phonetic symbols, they called — would “spell” a phoneme with just one to four symbols³. Unlike English, since the symbols were constructed and not naturally developed, each symbol combination would only represent one phoneme, and each phoneme would only be written by one symbol combination, systematically.

Aside: other linguists disagree and used Latin alphabets to “spell” the phonemes of the same language⁴. The principle is the same, though.

Aside: Local linguists of another imagery language picked a different route and decided to invent symbols to directly represent each sound. It’s “less” systematic, but it gets the job done too⁵.

Since the number of symbols is limited, they could then be arranged on a modern computer keyboard. Computers would then be loaded with a program⁶ that allows humans to pick the right monogram for each symbol combination, as they type.

Aside: For spelling the language with Latin alphabets, it is even easier — you don’t even have to arrange a different set of symbols on the keyboard.

When computers were dumb with limited capacity, these programs would only be implemented with a simple mapping table, mapping symbol combinations to monograms. This would be quite cumbersome, because words that people type are often the same monogram combinations, and humans really hate to repeat themselves.

Aside: A different school of programs for the same purpose would instead map monograms to visual symbols by decomposing their shapes, instead of the sounds they represent. Their mapping table would map visual symbol combinations to monograms⁷. This is very helpful for typing a monogram without knowing its sound and/or disregarding the spoken language being written. Some argue it is easier to type too, given that there can be arbitrary more symbol combinations designed for the program, than a fixed number of phonemes.

As computers become more powerful, a new class of programs would have developed. Instead of mapping symbol combinations to monograms (the shapes that make up the words), these programs would map multiple symbol combinations to words⁸. It would need a bigger mapping table for sure, and the table would also require constant curation, because of the endless evolution of human thoughts and their new words (comparably, new monograms are rarely added.)

Thankfully, computers are also powerful enough to manage these tasks. Maintaining and developing these bigger mapping tables are also helped by the fact that computers have since been connected across the planet⁹ (and its lower orbits¹⁰, to be exact) so it would not be hard for computers to find a large body of text written in the imagery language (a “text corpus“, linguists and computer scientists call) waiting to be extracted and processed¹¹.

Thus, through the ingenuity of these humble programs built upon linguistics knowledge, our imagery language would have been allowed to strive in the Information Age, expressed in monograms the same way it would have been written down for thousands of years, and perhaps thousands of years to come.

If you like this post, you would like my not-to-be-updated JSZhuyin and its interactive tutorial. I would imagine you will already be frequent on many YouTube videos on linguistics, and a fan of the movie The Arrival, like me.

Mandarin Chinese is the imagery language in question. ↩︎
Mandarin Chinese is traditionally written in Chinese characters, a monogram shared among East Asia languages. Among these languages, the usage of Chinese characters only survived in Chinese, Japanese, and Korean, abbreviated as CJK in the information processing field. ↩︎
This is how Bopomofo Phonetic Symbol system spells Mandarin sounds. It was invented in the 1900s. Invented in 1443, which predates modern linguistics, Hangul works more or less the same way for spelling Korean. ↩︎
The Pinyin system is designed to spell Mandarin with Latin alphabets. ↩︎
Japanese Kana is one such system. ↩︎
The programs are Input Method programs, or IMEs, the subject of our discussion here. ↩︎
An example of these kinds of IMEs is Cangjie input method, which codes Chinese Characters with 24 invented “radical” symbols. ↩︎
These newer IMEs are often dubbed “smart” or “intelligent” IMEs. As mentioned in the later paragraph, all IMEs are “smart” nowadays. ↩︎
Internet and World Wide Web, if you haven’t heard about it. ↩︎
There is Internet on International Space Station, usable by astronauts. One got sued for it (and vindicated.) ↩︎
This study of distilling human language using computers is called Natural Language Processing. ↩︎

I hate to touch /etc/hosts myself when a project asks me to do so as part of development setup. Tools like Vagrant use zeroconfig/Bonjour to dispatch a .local hostname, but no one seems to remember the ancient, all-mighty Proxy auto-config.

Proxy auto-config allows you to setup hostname and URL mapping in JavaScript. It’s supported on every OS and browser. To use that to setup your own development hostname, you would first need to write a PAC file:

function FindProxyForURL(url, host) {
    if (host === 'calypso.localhost') {
        return 'PROXY 127.0.0.1:3000';
    }
    return "DIRECT";
}

The example above points calypso.localhost to 127.0.0.1, as required by the Calypso project. Next, save this file as a private Gist, and get the raw, private, HTTPS URL.

You can file the place to put that URL in the Proxy setting section of your OS, for example, this is how to do it on macOS. Now, open any browser, the hostname should correctly connect to the IP address specified (Firefox comes with it’s own Proxy setting allowing you to specify a PAC, so if you would like the setting to affect only one browser, use Firefox).

This is the simplest way to use a PAC file. The Wikipedia article lists several other use cases for PAC files.

Important considerations

Use your own PAC file. I didn't share my PAC file hosting on Gist because you should never trust me nor anyone else for that. PAC file has the potential to redirect all your network traffic.
Turn off PAC auto-discovery. It rely on the expired WPAD protocol to discover PAC file served on your local subnet, but implementation hardly know the boundary of your local domain (it could by checking the Public Suffix list but that list changes almost monthly). Thankfully the good folks at WPADblock initiative secured a few public hostnames that could be exploited.

I hope you find this trick useful.

Blog: timdream

Life, n. the unending journey toward making sense of the world.

Category Archives: Programming

A new take on Asian IME for English audiences

Meet high-performance MapKit JS

Forget about /etc/hosts and use proxy.pac

Important considerations