Why does '👩🏿🎤'.length = 7? Is JavaScript UTF-8 or UTF-16? What happens under the hood when you set ? Have you ever wondered how emoji and complex scripting languages are encoded to work correctly across browsers and devices - for billions of people around the world? Or how new emoji are introduced and approved? Have you ever seen one of these: □ � “special” glyph characters before and want more information on why they might appear and how to avoid them in the future? Let’s talk about Unicode encoding in JavaScript and across the world wide web! We’ll go over best practices, common pitfalls, and provide resources to learn more - even where to go if you want to submit a new emoji proposal! :)
Emoji Encoding, � Unicode, & Internationalization
From:

JSNation Live 2020
Transcription
. Hi, I'm going to share with you some ideas about emoji, encoding, Unicode and internationalization. I'm Naomi Meyer and I work as a software development engineer at Adobe where I do localization and internationalization engineering for creative cloud products like Adobe fonts and Adobe portfolio. So here's where you can find me online. And if there's anything that you feel passionate about or have strong opinions on, please let me know I'd love to continue this conversation online. So, I'm sure most of us have seen this many times in the head tag of our html markup. And we all know to add this meta char set UTF eight line. But I've been fascinated lately about the underlying details of what this UTF eight thing really means, and does. So I'm excited to share some more details about what it is, and why I think it's so cool. Also, this connects with this javascript funniness that we can see here with these seemingly single character emojis. When as strings in javascript have a link, a lot longer than expected. And feel free to try this yourself in your dev tools. I know I wanted to test it out first when I first saw something like this. And part of my goal today is to talk about why this is why you know family emoji dot length is equal to 11 why that's true, and provide some more details about the underlying encoding happening here, and what we can do to handle it correctly. So speaking of emojis, I personally find them both delightful and intriguing from an engineering perspective, a linguistic perspective, a creative design perspective, a cultural sociological international perspective and so much more. So our agenda for today is to start with a bit of encoding history to understand more of where we came from. Then we'll get into the Unicode consortium, and the UTF a algorithm. Then we'll talk about how this allows us to encode emojis and different languages across platforms devices and operating systems. So overall, I think it's so important for us to keep these big ideas in mind when we build software products that are being used globally. This is kind of a timeline a broad timeline of what we'll touch on today. So we've got a lot to cover in these 20 minutes. So let's jump right in with encoding to start. Of course, when we're on our computers and we type an emoji, a letter, a character in any language. These are ultimately interpreted by the machine as zeros and ones. So let's get into a bit about how that works. So in order to understand how it's working today. Let's go back in the past to the 1960s of the first encodings. So back in the 60s, you know, there were these big computers that fill the whole room. This is a picture of one from NASA and engineers back then came up with a system. And that system is called ASCII, the American Standard Code for Information Interchange. And this image is from the first version that was published in 1963. ASCII was developed from telegraph code. So it was originally built for more convenient sorting of lists, you know, alphabetically by ascending descending characters. And in 1963 ASCII was encoded in 128 English only characters into seven bit integers. So I think ASCII is pretty cool because it makes sense on how it was built. So first we take a character like the letter A. And we assign an ASCII decimal number 65. Which in binary is equal to 1, 5 zeros 1 in the original seven bit system. And then after 65 goes 66 which is B. And we continue alphabetically all the way to Z, which is ASCII decimal 90. Then to go from uppercase to lowercase characters, we only change one bit 32 letter, 32 later, which is ASCII decimal number 97. And that continues alphabetically. So I think it's a cool system. And, you know, shout out to Tom Scott on this video that he has to explain it really clearly. So ASCII makes sense. But then, and it, you know, it kind of worked in America in the 60s. But over time, there were lots of bugs. And those bugs came because non English characters like those pictured here include diacritics and additional accents that were added. So ASCII was originally worked with seven bits, but then computers moved to eight bits. And we went from 128 characters to 256 characters. And different countries and different language systems added more characters with those extra 128. And different languages like Japanese, for example, did its entire own thing. They had a separate multi byte encoding system. You know, in Japanese, Russian, all these languages had a different encoding system. And that was fine when they worked independently. But then came the internet. And with the World Wide Web, the internet kind of broke computers because there was this problem with no universal encoding system. When two different non compatible encoding systems encountered one another, we got these types of errors like we see here with a lot of question marks and a lot of bugs. So in 1991, with the World Wide Web, we got Unicode version one. And Unicode was designed to be a universal encoding standard to solve this problem of conflicting character encodings. So Unicode today is a nonprofit whose mission is that everyone in the world should be able to use their own language on phones and computers. Unicode version 1.0 was published in October 1991 and had 7161 characters. So to understand and think about Unicode, you kind of have to make a mental shift of your assumptions about language and characters. So there's three kind of big ideas to think about. And props to Dimitri Pavlutin who has this great article I recommend called, What Every javascript Developer Should Know About Unicode. So the first idea to keep in mind is abstract characters. So instead of thinking about letters in an alphabet, it's good to think about abstract characters and Unicode deals with characters in these abstract terms. Second, we have code points. A code point is a number assigned to a single character. So this is an example of a code point where we have U plus 0041, an atomic unit of information. So in a code point, we have 0041, which is a hexadecimal number, and then we have the prefix of U plus, where U equals Unicode, because each code point number is given a meaning by the Unicode standard. And Unicode character sets map each abstract character in the world to a unique number. So U plus 0041, we look it up in Unicode, we get Latin capital letter A. And currently the Unicode standard defines over a million code points. And they all have a one-to-one mapping, so that ensures that there's no collision between alphabets of different languages. The third point to keep in mind is a plane. So basically, long story short, Unicode divides over a million code points into 17 planes or groupings. These planes are represented here. So the first plane, plane zero, is the basic multilingual plane, also known as the BMP. And that's the unification of all the prior character sets, so that includes ASCII and Chinese, Japanese, and Korean characters. And this is what the BMP looks like. And I think it's fascinating to see the breakdown of different scripts included. You know, you can see East Asian scripts and the South, the Chinese, Japanese, Korean characters included a lot of additional code points. So the BMP is four hexadecimal digits. And then outside the BMP, plane one is five hexadecimal digits, and plane 16 is six hexadecimal digits. And outside of the BMP is the astral plane or the supplementary planes. So how does this relate to our head tag with UTF-8 that we've all, you know, are used to seeing? So we know about code points, abstract characters like U plus 0041 is A. And we know about code units or physical bits because, you know, computers at the memory level don't use code points or abstract characters. They need a physical way to represent Unicode code points. So computers translate from Unicode code points into physical bits using a character encoding translation, which takes the transformation from code point into physical bits. And Unicode has this popular open source character encoding translation algorithm called the Unicode Transformation Format or UTF, which does this job for us. And popular encodings of UTF are UTF-8, UTF-16, and UTF-32. So UTF is really cool. It's reversible, so the conversions between all of them are algorithmically based. That's hard to say. They're fast and support lossless round tripping. UTF-8 is most common on the web. UTF-16 is used by Java and Windows. And both UTF-8 and UTF-32 are used by Linux and various Unix systems. But javascript is weird. So there's these two great articles I highly recommend from Mattias Beynards that go more into depth of javascript and Unicode. But basically, long story short, the ECMAScript engine itself exposes characters according to UCS2, not UTF-16. So javascript engine encoding can have these strange negative effects when javascript works with Unicode. And this is really important to keep in mind because, you know, as developers, we do a lot of string manipulation. We do sorting, filtering, searching, lookups, a lot of business logic involved with strings. So it's great to understand how these strings are encoded to reduce bugs. This slide looks weird. Let's move to the next one. So something that's a good example of how javascript is kind of strange with strings is the example of a combining mark. So in Danish, you have the letter A with a little circle over it that is in Unicode two separate characters, where the circle over it is a combining mark that is designed to modify the preceding character. So if we console log these two Unicode code points, we get the combining mark A. This is one example with a combining mark. Another example we have here is cafe with the accented E. And this one is different because if we define, you know, this variable as drink with cafe with the E, we can see that drink.length, while it looks like it should be four characters, actually comes out in javascript as five characters. And then if we split this string into an array, we can see that the final element of the array is that final accent. And so a lot of confusion can arise when we kind of ignore the concept of code unit sequences. So this is really important to get to keep in mind to avoid string bugs like we see here in cafe with an accent. So there's a great solution that we can use, and that's string.prototype.normalize, which is a method that returns the Unicode normalization form of the string. And there's the link to the MDN docs about it. I think it's fascinating. So if we take that same example with cafe with an accent and we normalize it, we see that the length is now four. And if we split it into an array, we don't get that accent as the final element. So this is a great tool to avoid localization bugs and to really think deeply about how we're building our apps so that they can be internationalized. So remember that timeline. There's another important element in the timeline, and that's in 2010, on Unicode version 6.0, Unicode added support for emoji. So let's move on to talk about emoji. So remember the Unicode planes. Outside of the BMP is the astral plane, and the astral plane or the supplementary plane is greater than four hexadecimal characters. So because all emoji are outside of the BMP, they're in the astral plane. All emoji are five hexadecimal digits, which require 21 bits to save in memory. But UTF-16 requires two code units of 16 bits in order to handle that. So with this example of the smiley face, which is the Unicode code point of U plus 1F600, we can see that we require two code units. So they're split into surrogate pairs where we have a high surrogate code unit and a low surrogate code unit. And this is the math to get the surrogate pair where we input the astral code point of the smiley face and we return the two surrogate pairs of high and low. So because of this, if we take smiley face and split it, we can see that there's two characters there that are garbled. And smiley face as a string dot length is two, which is not what you would think. So and this is true also for longer emoji. Let me just quickly share a fun example. If we open up the console and you know you can do this in your browser if you want. Let me close all these warnings. So if we do simple family, right, this is family and on Mac you can do control command space to open up your emoji. We can see that family dot length is 11. And if we try family dot split, there's those 11 characters split into an array. And then if we try a different one, let's do, let's do like singer with dark skin dot length, we see seven. And if we do singer dot split, again, we see seven. And you can try this with lots of emoji and the links are surprising, which is fun. Let me pull up that slide again. So the question is why are these seven, why are they 11? And the answer is multifaceted. But one of the, let me close the inspector here. There we go. Okay. So one of them is the zero width joiner. So the zero width joiner is not an emoji itself, but it's used in emoji sequences as glue to attach different emoji code points together and create compound emoji sequences. Another important emoji to keep in mind is the variation selector, which takes a plain text representation of emoji like this heart that we see here and stylizes it to display plain text characters with their colorful emoji representation. So we have a styled heart, but when we take the heart emoji plus the variation selector, which is U plus F E O F to create the styled heart. Another emoji variant is the emoji modifier. And this is the skin tone modifier, and it's based on the Fitzpatrick scale, which is actually used in dermatology about UV light exposure differences. So it's not really a race or ethnicity scale here. It's a dermatology scale. And this skin tone modifier was added in Unicode 8.0. So if we again open up that singer example, we can see that the combination of the singer is we take woman emoji, which again with the high and low surrogate pairs is a length of two, plus skin tone modifier emoji, plus zero width joiner emoji, plus microphone emoji. We get the total combination of singer emoji equal to all of those combined, which I just think is so cool and awesome about emoji. So let's end with emoji today. So emoji are like fonts, and they have variation across platforms. Here we see in the browser, in Apple, Google, Facebook, all these different emoji and how they can be represented differently, just like fonts in the browser. And again with the hugging face, party face, and face with tears of joy emoji, which is one of my favorites. And then I think this is funny because the cell phone emoji or the mobile phone emoji looks like the phone of the provider across different fonts. And the hamburger emoji, this was a big internet controversy because the placement of the cheese varied, whether it was on the meat or on the bun, and people were very upset about it. And then the gun emoji, this one actually has sparked a lot of discussion among legal experts on whether or not emoji can be permissible as evidence in a court trial. And the different providers actually changed it from looking like sort of a realistic handgun to instead looking like a toy water gun to try and avoid this controversy. So, with that in mind, I think it's really important to know that anyone can propose a new emoji for the next version of Unicode, you can go on this link, and it is a process, but it's possible. So for example, these four emoji that we see here were all sort of externally proposed to Unicode. With the goal. So the the blood drop emoji we see here. It's featured in the documentary, beyond the emoji, which I recommend. And the goal of this one was to break the stigma of menstruation. The woman with the hijab headscarf emoji there was someone who said, Hey, I wear a headscarf, I don't have any emojis that look like me. I want to get one in into Unicode and they made that happen. And Jennifer Eight Lee who's really incredible emoji leader she is from Chinese background and she wanted a dumpling that was part of her culture so she worked hard to get a dumpling included in emoji. So, you know, I've been to the Unicode conference, and while they're doing great work from an engineering standpoint and the encoding is fascinating. The Unicode consortium, the people who are determining emojis and making changes for future versions of Unicode. It's pretty male and pale. So I think it's really important to have more transparency about this process, so that we can get more emoji out there. And encode them into bits, because it's not just a matter of having kind of the right icon to describe what you ate for lunch. It's about digital acknowledgement of culture, about who gets to be represented in our future digital language, and how do those representations taken into account, different things like ethnicity, religion, and more. So, I highly encourage people to learn more about Unicode learn more about how they can propose a new emoji, and to get involved in the guts of the encoding process. So that's all I've got. Thank you so much. Awesome. Thanks so much. Hey, how's it going? Hi, doing well. Good, good, good. As I was saying to some of my co-chefs what an informative presentation that was. I'm looking forward to having a chat with my friends tomorrow and randomly bringing up the question as to why the family emoji is 11 characters long. And they're going to be like, Oh, Chris, you're so smart. I'm like, Yep, I am. But yeah, we do have quite a number of questions from the audience that I'll get into. So first, the first question is from Nick's. Chris wants to know. So because some emojis is like a compound, can you replace elements of it to produce something different. That's a really fun question. You know what, I would love to try that. I don't see why not. It makes sense. Explore, give it a try. I think it would have to be a valid. It would need a design in the font face that you've selected. So if you're on a Mac, you know, there would need to be an Apple design for that emoji. And if there is, then you can probably change them using the different Unicode characters if you want. Instead of using the emoji picker. And then this question was from Rin and Amir. They wanted to know what is your favorite emoji and why. I love emoji because there's so many different ways that you can use them. And I often will use a wide variety. And I try and use the least the less popular emojis. But I think my all time favorite is probably the the face with tears, the joy face with tears just because it's happy and fun. Do you have one, Chris? Oh, man, like mine. Mine is usually the best one is just this, you know, the side just saying like, OK, that's mine. That's mine, because pretty much it's always a long thread and I just can't get through it. And I'm just like, oh, yeah, right. Get right back. Yeah. I like the thumbs up one, too. Exactly. Exactly. People understand. I'm not busy. I'm just too lazy to respond. We have a question from Cynthia. And the question is, so if someone is building a Twitter like application and then we were limiting the characters in the message using string length to calculate character amount, calculate character count. How could you make sure that an emoji counts as one character instead or sometimes seven? Yes. So this is really interesting. I was thinking about doing a Twitter demo because Twitter counts each individual emoji as two characters. So you can try it out yourself. You know, on Twitter, you can have a string of all A's. And then if you add an emoji at the end, you'll see that one emoji is two characters. So that's something to keep in mind. And another thing I find fascinating about Twitter emojis is that certain hashtags include an emoji. So, for example, I've been using that hashtag Black Lives Matter a lot lately. And at the end, Twitter includes the sort of fist emoji. And so I don't know the exact number of characters, but you'd think it would just be the ABC characters and not the emoji when you're counting for character count. But the emoji is included as two characters in the hashtags that have emojis, if that makes sense. And then a question I also had as well, because I was quite fascinated by your talk, is I wanted to find out what role diversity of culture and traditions play in the design of emojis. A big role. That's something that I think is really important with emoji is to try and keep them representative of the actual world of people who are using them. You know, if these little tiny pictures that exist on every phone, every computer, all across the world, all across different languages, I think to be more accessible, they should represent the actual people using them. And that's why, you know, using different gender modifiers or different skin tone modifiers is really important to include and to handle that encoding because then we can, you know, make emojis as diverse as the people using the emojis. And I think it's really useful for everyone to learn more about Unicode and the sort of quiet consortium that's working behind the scenes to add new emoji because things like, you know, the women with the hijab or the rickshaw or the dumpling emoji, these are representative of different cultures that maybe the original Unicode consortium didn't include. And I think it's really important to get more of those out there. For sure. For sure. And then I have a question from Drum. Drum wants to know, so can we use UTF-16 or UTF-32 in the web? Yeah, that's a good question. So, it's recommended to use UTF-8 online on the web. And it's important to remember that the web is encoding in UTF-8. So, if you have, you know, a database or a backend process that's using UTF-32 or might be using ASCII or a different encoding altogether, to keep that in mind because when, you know, a string gets encoded and decoded, there could be conflicts if they're different systems. So, the benefit of UTF-8, UTF is that it works if you're using UTF-32 or 16 or 8. But if you're using a different system, like we talked about, you know, javascript under the hood has some UCS2, that's where the problem arises. So, always know at each point in your full app architecture what the encoding and decoding is so that you can remove any of those bugs. Yeah, yeah, for sure, for sure. And I feel like this talk, just based off of one, the fact that emojis are something that we all utilize and two, just the fact that when they all represent us as individuals quite accurately, a lot of them as well, I feel like this talk is one that could go on and on and on and on, even beyond. But guys, Naomi is not going, she's going to be moved over to her own Zoom room. So, there are so many questions still piling in that we just don't have time to carry on to finish with here. But yeah, I guess for myself, you know, I just had one final question and I promise this would be, this is only food related because we are doing a pseudo chef show. But I guess myself and the other co-chefs wanted to find out what is your comfort food choice? That's a great question. And my family is Jewish and we have matzo ball soup, which is like the Jewish chicken noodle soup. And that's definitely my comfort food. And I was thinking that could actually be another emoji, a matzo ball soup emoji. There you go. You know, ideas, man, ideas. Now you know. But Naomi, thank you very much for an informative presentation. You've definitely, you're definitely going to make me, you know, I'm going to be much smarter in front of my friends because of it for at least the next week or so, because I'm just going to be relaying everything you just mentioned. And then some. And yeah, please. She will be in, she'll be going to the Zoom room. So feel free to keep the question coming, you know, peeps. And yeah, Naomi, till we speak again. Great. Thank you so much, Chris. Thank you. Bye bye.