Emoji Encoding, � Unicode, & Internationalization

Rate this content
Bookmark

Why does '👩🏿‍🎤'.length = 7? Is JavaScript UTF-8 or UTF-16? What happens under the hood when you set ? Have you ever wondered how emoji and complex scripting languages are encoded to work correctly across browsers and devices - for billions of people around the world? Or how new emoji are introduced and approved? Have you ever seen one of these: □ � “special” glyph characters before and want more information on why they might appear and how to avoid them in the future? Let’s talk about Unicode encoding in JavaScript and across the world wide web! We’ll go over best practices, common pitfalls, and provide resources to learn more - even where to go if you want to submit a new emoji proposal! :)

FAQ

UTF-8 is commonly used in web development for encoding web pages. It ensures that text appears correctly across different platforms and devices, handling various character sets efficiently.

Unicode assigns a unique code point to every character, no matter the platform, program, or language, ensuring consistent representation across different systems. It supports over a million code points which cover a comprehensive range of characters and symbols.

The meta charset UTF-8 line in HTML specifies that the character encoding for the webpage is UTF-8. This is crucial for correctly displaying text that includes special characters and symbols from multiple languages.

In JavaScript, emojis might be encoded as multiple code units, especially those outside the Basic Multilingual Plane (BMP). This can make the length of a string containing emojis appear longer than the number of visible characters.

Developers can use JavaScript's string normalization features, like the normalize() method, to handle emoji encoding correctly. This method adjusts the string's encoding so that each emoji is consistently counted as a single character.

ASCII was initially designed to encode 128 English characters, which was insufficient for languages with diacritics and other symbols. This limitation led to the development of extended ASCII and eventually Unicode for broader language support.

The zero-width joiner (ZWJ) is used in emoji sequences to combine multiple emojis into a single glyph. For instance, family emojis are created by combining individual human emojis with a ZWJ, resulting in a single, composite representation.

Surrogate pairs are used in UTF-16 encoding to represent characters outside the Basic Multilingual Plane. These characters require more than 16 bits and are encoded using two 16-bit code units.

UTF-8 is recommended for the web due to its efficiency in encoding a vast range of characters using 1 to 4 bytes, compatibility across different systems, and its ability to handle any Unicode character, making it ideal for international environments.

Naomi Meyer
Naomi Meyer
34 min
18 Jun, 2021

Comments

Sign in or register to post your comment.

Video Summary and Transcription

This Talk explores the UTF-8 encoding and its relationship with emojis. It discusses the history of encoding, the birth of Unicode, and the importance of considering global usage when building software products. The Talk also covers JavaScript's encoding issues with Unicode and the use of the string.prototype.normalize method. It highlights the addition of emoji support in Unicode, the variation and proposal process for emojis, and the importance of transparency in emoji encoding. The Talk concludes with the significance of diverse emojis, the recommendation of UTF-8 for web development, and the need to understand encoding and decoding in app architecture.

1. Introduction to UTF-8 Encoding and Emojis

Short description:

I'm Naomi Meyer, a software development engineer at Adobe, and today I'll be talking about the UTF-8 encoding and how it relates to emojis. We'll also touch on the Unicode Consortium, the history of encoding, and the importance of considering global usage when building software products. Let's start by understanding how characters are interpreted by computers, and then delve into the first encodings, specifically ASCII.

Hi, thanks for that great introduction. I'm Naomi Meyer, and I work as a software development engineer at Adobe, where I do localization and internationalization engineering for creative products like Adobe Fonts and Adobe Portfolio. So here's where you can find me online, and if there's anything that you feel passionate about or have strong opinions on, please let me know. I'd love to continue this conversation online.

So I'm sure most of us have seen this many times in the head tag of our HTML markup, and we all know to add this meta charset UTF-8 line. But I've been fascinated lately about the underlying details of what this UTF-8 thing really means and does. So I'm excited to share some more details about what it is and why I think it's so cool. Also, this connects with this JavaScript funniness that we can see here with these seemingly character emojis, when as strings in JavaScript have a length a lot longer than expected. And feel free to try this yourself in your dev tools. I know I wanted to test it out first when I first saw something like this. And part of my goal today is to talk about why this is, why you know familyEmoji.length is equal to 11 why that's true. And provide some more details about the underlying encoding happening here and what we can do to handle it correctly.

So speaking of emojis, I personally find them both delightful and intriguing from an engineering perspective, a linguistic perspective, a creative design perspective, a cultural sociological international perspective, and so much more. Our agenda for today is to start with a bit of encoding history to understand more of where we came from. Then we'll get into the Unicode Consortium and the UTF-8 algorithm. Then we'll talk about how this allows us to encode emojis and different languages across platforms, devices and operating systems. Overall, I think it's so important for us to keep these big ideas in mind when we build software products that are being used globally. This is kind of a timeline, a broad timeline, of what we'll touch on today. We've got a lot to cover in these 20 minutes. Let's jump right in with encoding to start.

Of course, when we're on our computers and we type an emoji, a letter, a character in any language, these are ultimately interpreted by the machine as zeros and ones. Let's get into a bit about how that works. In order to understand how it's working today, let's go back in the past to the 1960s of the first encodings. Back in the 60s, there were these big computers that filled a whole room. This is a picture of one from NASA. Engineers back then came up with a system. That system is called ASCII, the American Standard Code for Information Interchange. This image is from the first version that was published in 1963. ASCII was developed from telegraph code. It was originally built for more convenient sorting of lists, alphabetically by ascending, descending characters.

2. ASCII and the Birth of Unicode

Short description:

In 1963, ASCII was encoded in 128 English-only characters into 7-bit integers. Over time, bugs arose due to the inclusion of non-English characters with diacritics and accents. The internet exacerbated the problem of conflicting encodings, leading to errors and question marks. In 1991, Unicode version 1 was introduced as a universal encoding standard to address this issue. Unicode Today's mission is to enable everyone to use their own language on devices. Unicode version 1.0, published in October 1991, had 7,161 characters. Understanding Unicode requires a shift in thinking about abstract characters and code points.

In 1963, ASCII was encoded in 128 English-only characters into 7-bit integers. So, I think ASCII is pretty cool because it makes sense on how it was built.

So first, we take a character, like the letter A, and we assign an ASCII decimal number, 65, which in binary is equal to 1, five zeros, one, in the original 7-bit system. And then after 65 goes 66, which is B, and we continue alphabetically, all the way to Z, which is ASCII decimal 90. Then to go from uppercase to lowercase characters, we only change one bit, 32 letter, 32 later, which is ASCII decimal number 97. And that continues alphabetically. So I think it's a cool system. And shout out to Tom Scott on this video that he has to explain it really clearly.

So ASCII makes sense, but then, and it kind of worked in America in the 60s. But over time, there were lots of bugs. And those bugs came because non-English characters, like those pictured here, include diacritics and additional accents that were added. So ASCII was originally worked with seven bits. But then computers moved to eight bits, and we went from 128 characters to 256 characters. And different countries and different language systems added more characters with those extra 128. And different languages like Japanese, for example, did its entire own thing. They had a separate multi-byte encoding system, you know, and Japanese, Russian, all these languages had a different encoding system. And that was fine when they worked independently. But then came the internet. And with the worldwide web, the internet kind of broke computers because there was this problem with no universal encoding system. When two different non-compatible encoding systems encountered one another, we got these types of errors like we see here with a lot of question marks and a lot of bugs.

So, in 1991, with the worldwide web, we got Unicode version 1. And Unicode was designed to be a universal encoding standard to solve this problem of conflicting character encodings. So, Unicode Today is a non-profit whose mission is that everyone in the world should be able to use their own language on phones and computers. Unicode version 1.0 was published in October 1991 and had 7,161 characters. To understand and think about Unicode, you kind of have to make a mental shift of your assumptions about language and characters. So, there's three kind of big ideas to think about. Props to Dimitri Pavloutin who has this great article I recommend called What Every Javascript Developer Should Know About Unicode. So, the first idea to keep in mind is abstract characters. So, instead of thinking about letters in an alphabet, it's good to think about abstract characters and Unicode deals with characters in these abstract terms. Second, we have code points.

QnA

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

How do Localise and Personalize Content with Sanity.io and Next.js
React Advanced Conference 2021React Advanced Conference 2021
8 min
How do Localise and Personalize Content with Sanity.io and Next.js
Structuring your content with Sanity.io means you can query content based on signals from your visitors, such as their location. Personalisation is a tricky problem with static sites and the jamstack, this demo will show you how it can be done with Sanity.io, Next.js, and Vercel.
End-to-end i18n
React Advanced Conference 2021React Advanced Conference 2021
26 min
End-to-end i18n
There are some great libraries that help with i18n in React, but surprisingly little guidance on how to handle i18n end-to-end. This talk will discuss best practices and practical advice for handling translations in React. We will discuss how to extract strings from your code, how to manage translation files, how to think about long-term maintenance of your translations, and even avoiding common pitfalls where single-language developers tend to get stuck.
Building JS Apps with Internationalization (i18n) in Mind
JSNation 2022JSNation 2022
21 min
Building JS Apps with Internationalization (i18n) in Mind
At Adobe we build products for the world, this talk with provide a high level overview of internationalization (i18n), globalization (g11n), and localization (l10n) best practices. Why these are important and how to implement in design, UX, and within any JS codebase - using vanilla JS examples, and top open source library recommendations.
Internationalizing React
React Summit Remote Edition 2021React Summit Remote Edition 2021
29 min
Internationalizing React
Learning 100 different languages is challenging, but architecting your React app to support 100 languages doesn't have to be. As your web application grows to a global audience, multilingual functionality becomes increasingly essential. So, how do you design your code such that it is flexible enough to include all of your international users? In this talk, we will explore what it means and what it looks like to build a React app that supports internationalization (i18n). You will learn several different strategies for locale-proofing your application with React contexts and custom hooks.
Localization for Real-World Use-Cases: Key Learnings from Onboarding Global Brands
React Summit 2022React Summit 2022
8 min
Localization for Real-World Use-Cases: Key Learnings from Onboarding Global Brands
i18n isn't easy, but with careful planning of your content model I'll show you how to structure the setup, authoring, and querying of localized content. Covering whole-or-part translated documents, the difference between market and language-specific content, ways to author that in a CMS like Sanity, and ways to query for it on frontends like Next.js and Remix.
i18n Was the Missing Piece: Let 70%+ of the Users in the World to Access Your Apps
JSNation 2023JSNation 2023
13 min
i18n Was the Missing Piece: Let 70%+ of the Users in the World to Access Your Apps
Accessibility, better DX, and performance get a lot of attention as it improves better UX significantly. Plus, it gives satisfaction to devs by seeing the significant improvements. But how about internationalization? A fun fact: Over 70% of the users in the world access non-English content. In this talk, I'll show you more surprising facts about internationalization and what are scalable approaches. You'll see examples with libraries for frameworks with a few different logic to implement different internationalization layouts.

Workshops on related topic

Localizing Your Remix Website
React Summit 2023React Summit 2023
154 min
Localizing Your Remix Website
WorkshopFree
Harshil Agrawal
Harshil Agrawal
Localized content helps you connect with your audience in their preferred language. It not only helps you grow your business but helps your audience understand your offerings better. In this workshop, you will get an introduction to localization and will learn how to implement localization to your Contentful-powered Remix website.
Table of contents:- Introduction to Localization- Introduction to Contentful- Localization in Contentful- Introduction to Remix- Setting up a new Remix project- Rendering content on the website- Implementing Localization in Remix Website- Recap- Next Steps