Tom explores the history and necessity of standardized set of characters for computers.
Featuring Tom Merritt.
Please SUBSCRIBE HERE.
A special thanks to all our supporters–without you, none of this would be possible.
Thanks to Kevin MacLeod of Incompetech.com for the theme music.
Thanks to Garrett Weinzierl for the logo!
Thanks to our mods, Kylde, Jack_Shid, KAPT_Kipper, and scottierowland on the subreddit
Send us email to firstname.lastname@example.org
Mommy? Where do emojis come from?
Have you ever thought about how we decide which characters go into a font?
Is there a secret world cabal ruling over letters? Are they called the letter people?
Let’s help you know a little more about Unicode
Unicode is a standard for the representation and handling of text. Pretty simple. Any system that respects the Unicode standard supports the same set of characters.
It’s maintained by the Unicode Consortium and synchronized with the International Standard Organization’s Universal Coded Character Set for IT.
The Unicode Consortium is a nonprofit organization whose nine full members include Adobe, Apple, Facebook, Google, IBM, Microsoft, Netflix, SAP and the Ministry of Endowments and Religious Affairs of Oman.
Voting members include companies with an interest in text-processing standards including the Bangladesh Computer Council, Emojipedia, Monotype Imaging, Tamil Virtual Academy, and the University of California, Berkeley.
As of Unicode 13.0 there are 143,859 characters covering 154 modern and historic scripts, symbol sets and emoji.
The reason you need this is computers just use zeros and ones. So programs decide how represent characters. If I write a very important message about the hit musical group Abba, where I define A as 1 and B as 0 but YOU define A as 0 and B as 1 then I’ll write Abba but you’ll read BaaB and have no idea what I’m talking about.
So you need a standard.
The Unicode standard itself is made up of a bunch of charts for visual reference, a set of standards for encoding characters, reference files and a bunch of other stuff like rules for how to display text when its left to right vs right to left and even bidirectional like Arabic and Hebrew that has both.
One of the earliest most widely-used character sets was ASCII. ASCII could be its own episode but here’s the short version. It stands for American Standard Code for Information Interchange. It was first published in 1963 and revised many times until 1986. It defines 128 characters. That’s plenty if you’re just dealing with the English language alphabet which needs 26 upper case characters, 26 lower case characters some punctuation marks and a slate of control characters for printers.
But if you want ALL languages you’re going to need a lot more. Hence Unicode.
But ASCII is not gone
In 1991, it was incorporated into Unicode as the first 128 symbols of Unicode’s UTF-8 encoding set with same numeric code in both sets. That made UTF-8 backwards and forwards-compatible with ASCII.
In fact the first 256 code points were also made identical to the ISO/IEC 8859-1 standard as well. The aim of making it easy to adopt has led to a few quirks. For instance some essentially identical characters, like the P in the Roman alphabet and R in the Cyrillic alphabet, have separate encodings. This kind of decisions were made to allow easy conversion from legacy encoding to Unicode and back.
So how did Unicode get started?
By the mid 1980s computer companies had extended ASCII for their own use, especially because 8-bit and 16-bit computers gave them lots of room to do so. BM, Microsoft, HP, Apple, and Adobe started using different characters beyond the first 128 for things like accented letters, symbols, Greek letters et cetera. And the advance of computing around the world meant software for other languages had to invent their own characters sets.
So in 1987, Joe Becker from Xerox and Lee Collins and Mark Davis from Apple started looking into creating one universal character set that everybody could use and let software share text easily.
They investigated comparisons of fixed-width and mixed-width text access the total system storage requirements with two-byte text; and preliminary character counts for all world alphabets.
Becker also sought input from Xerox’s Dave Opstad and Peter Fenwick and published a draft proposal in August 1988 tentatively called Unicode. Explaining that the name suggested “a unique, unified, universal encoding.”
Becker described Unicode as “wide-body” ASCII meant to encompass living languages. Future use was prioritized over preserving the past. He suggested that languages not in modern use were better served buy separate registries rather than enlarging the public list. This usage standard has kept Unicode from including some newly-invented scripts like Klingon, which is instead defined in the private Conscript Unicode Registry.
The Unicode team expanded in 1989 to include people from the Research Libraries Group and Sun Microsystems and in 1990 from Microsoft and NeXT.
A final review draft was finished but he end of 1990.
The Unicode Consortium was incorporated in California on January 3, 1991 and the first volume of the standard was published in October 1991.
Earlier we mentioned UTF-8 and you may have seen references to that out in the wild or even others like UTF-16.
These are different character encodings of the standard called Unicode Transformation Formats. The encodings use different numbers of bytes to represent code points that an be combined into what are called graphemes and grapheme clusters, basically what you think of as a character. This is important for complex character sets like Chinese or Korea’s Hangul. The standard can represent letters, digits, diacritical marks, punctuation marks and technical symbols. And yes, emoji.
The most commonly used encodings are UTF-8, UTF-16 and UTF-32.
Though you also may see UCS-2. That’s an older encoding not fully supported by Unicode because it doesn’t have all the characters. UTF-16 is basically UCS-2 extended to include all he characters. The difference is UCS-2 only uses 2 bytes for each character so it uses less space. UTF-16 uses 2 bytes for all he UCS-2 characters but 4 bytes for the rest.
UTF-32 uses 4 bytes for any code point and takes significantly more space.
But the most widely encountered is UTF-8. It uses one byte for the first 128 code points and up to 4 bytes for the rest. That means it overall uses less space and is used for more than 95% of websites.
How did emoji happen?
Well again we could do a whole episode on emoji and their predecessor emoticons.
But the short version is people tried to create a standard graphical way to represent the winky face of a semicolon and a close parentheses. Microsoft created the famous Wingdings Font which had some smileys and sad faces. Zapf Dingbats was another one, that was also included in the Unicode standard. Multiple organizations created all kinds of attempts but the roots of modern emoji are in the sets of pictures included o Japanese cell phone platforms in the late 1990s. They added the innovation of including the pictures in their character encoding sets along with text instead of separately. So you could add a smiley at the end of your written text.
Emojipedia believes that the SkyWalker DP-211SW from J-Phone in 1997 may be the first phone with modern emoji. This set included the now famous pile of poo.
The most influential set of emoji were created by Shigetaka Kurita in 1999 for NTT DOCoMo.
In the mid-2000s employees at Google and then Apple requested Unicode include a uniform emoji set.
In August 2007, Mark Davis and his colleagues Kat Momoi and Markus Scherer wrote the first draft for consideration by the Unicode Technical Committee (UTC) to introduce emojis into the Unicode standard.
UTC had previously determined emojis to be like Klingon, out of the scope of Unicode. But they historically charged their mind.
The effort widened to include ARIB extended characters used in broadcasting in Japan as well as consultation from multiple national standards bodies worldwide. A set of 722 emojis was finally agreed upon and released in 2010 as part of Unicode 6.0.
Unicode defines emoji characters but particular representation, like fonts, can vary between providers. Apple has different emoji looks than Google etc.
So there you go. The work is never done. The Unicode Technical Committee meets quarterly to decide whether new characters will be encoded. Proposals are accepted from any organization or individual whether they are members or not. Various subcommittees exist to recommend proposals to the full UTC. Technical decisions relating to the Unicode Standard are made by the Unicode Technical Committee.
So now I hope you know about why text shows up the way it does and who decides whether you get a smiley face or not.
In other words, I hope you know a little more about Unicode.