JS Character Encodings

Rate this content
Bookmark

Character encodings can be confusing for every developer, providing pitfalls even for the most experienced ones, so a lot of the time we want to end up with something that “just works” without an in-depth understanding of the involved concepts. In this talk, Anna will give an overview over what they are, what the JavaScript language provides to interact with them, and how to avoid the most common mistakes in Node.js and the Web.

FAQ

Strings are sequences of characters, like text, while sequences of bytes represent the digital encoding of these characters. In computing, these bytes can represent anything and are not exclusively tied to textual data.

UTF-8 is backwards compatible with ASCII because the first 127 bytes of UTF-8 match ASCII exactly, allowing UTF-8 encoded data to be read by systems that support ASCII without modification.

JavaScript engines can use various methods to store strings, often defaulting to UTF-16 encoding. However, engines are optimized to use different encodings based on the content of the string to conserve memory and enhance performance.

Using different encodings can impact how data is processed and stored. For instance, encoding mismatches can lead to data loss or misinterpretation, especially when interacting with external systems or networks.

The TextDecoder API in JavaScript allows decoding of text from binary data using different character encodings. It supports options like 'fatal' and 'stream' to handle errors and stream decoding, providing flexibility in data handling.

Understanding character encoding is crucial because it ensures that data is interpreted correctly across different systems and platforms. This is essential for maintaining data integrity and compatibility in global applications.

A common issue in Node.js is the misalignment of character data across data chunks when using certain encoding methods. This can lead to characters being incorrectly decoded if the data chunks do not align with character boundaries.

Anna Henningsen
Anna Henningsen
33 min
14 Apr, 2023

Comments

Sign in or register to post your comment.

Video Summary and Transcription

Character encodings are important for converting characters into bytes. UTF-8 is the most commonly used encoding in JavaScript. JavaScript engines handle character encodings automatically. There are bugs in Node.js related to character encoding and string manipulation. It is important to be cautious when working with character encodings and to choose the appropriate method for string manipulation.

Available in Español: Codificación de caracteres en JS

1. Introduction to Character Encodings

Short description:

I am working at MongoDB working on the Developer Tools team. So let's jump in. Why are character encodings important? Your program is typically run by an operating system that has no idea what a string is. The solution is to assign numbers to characters and convert them into bytes. Strings and sequences of bytes are different things. Historically, people came up with ways to assign numbers to characters, like ASCII and character encodings for different languages.

I am working at MongoDB working on the Developer Tools team so the Shell and the GUI and the VSCode extension for the database but this talk has absolutely nothing to do with that. So let's jump in.

So about a month ago or so I saw this tweet which got somewhat popular on Twitter and you know... Some people are laughing, you get the joke. Obviously, the easiest way to get the length of a string in JavaScript is to do object spread in it then call object.keyson.object and then use array prototype reduce to sum up the length of that array. So we all know what the joke is. But let's take a step back.

Why are character encodings sometimes something that we care about or have to deal with? The typical situation that you're in is you're a software developer and you're writing software. You're writing a program. That program does not exist in isolation. There is something else out there, literally anything but your program like the file system, network, other programs, other computers, anything like that. And obviously you want your software to be able to communicate with them. The default way to communicate anything is to use strings. You can put basically anything in a string. Any data you have you can serialize into a string. So it would be nice if we could talk with these other programs using strings. Unfortunately, that's not how it works.

Your program is typically run by an operating system that has no idea what a string is. If it's a Javascript program, which is going to be the case for many of you, a Javascript string is something that the Javascript engine understands, but your operating system has no idea what to do with that. You can't just pass it directly to that. That also means you can't pass it to other things. So the solution that people came up with is, you have your string, and for each character in that string you assign that character a number, and then you come up with some clever way to assign or convert these numbers into a sequence of bytes. And this feels like a very basic discussion to have, but I think it's important to have that distinction in mind.

When I say strings, I mean sequence of characters, like text. This intermediate representation, which for the most part you don't care about, I'm going to refer to that as code points, because that is the language that Unicode uses for this, and then your output is a sequence of bytes. Obviously when you're decoding you go these steps in reverse. If you take anything away from this talk, it's that strings and sequences of bytes are different things. Historically, how people have approached that, back in the 70s when Americans had not yet discovered that there is something besides America in the world, you came up with a way to assign, a standardist way to assign numbers to characters, and those were characters from 1 to 128, and that's enough space for lowercase and uppercase English alphabets and some special characters and, you know, who needs more than that? Then the next iteration, which is a little bit more popular around the 90s I would say, is, you know, you discover that there are other languages out there besides English, and you say like, okay, well, ASCII is 128 characters, so 7 bits, bytes usually have 8 bits, so we have another 128 characters available. And the solution that people came out with was like, you know, you're probably either going to have Greek text, or Slavic text, or Arabic text, you're not going to mix these probably. So, for each of these you create a character encoding.

2. Character Encodings and JavaScript

Short description:

And so these ISO-8859 character encodings, they're like 16 different character encodings each of the additional characters that are not ASCII have an additional meaning. Unicode solves the problem by allowing as many code points as we want. UTF-8 is the most commonly used encoding, and it is backwards compatible with ASCII. UTF-16, on the other hand, uses two bytes per character but can require four bytes for certain characters. JavaScript lets you interact with strings as if they were stored using UTF-16.

And so these ISO-8859 character encodings, they're like 16 different character encodings each of the additional characters that are not ASCII have an additional meaning. But you can't mix, like you can't have a single byte sequence that can represent both, say, Greek and Arabic text, and sometimes you might want that. So something that got popular towards the end of the 90s is Unicode.

And so Unicode essentially solves that problem by saying, yeah we're not going to stick to single byte encodings, we're just going to have as many code points as we want. There is a limitation, like around one million code points currently, but that's, I mean, we're not close to hitting that currently. I don't think we're going to get that many emojis, so I think that's OK. What is sometimes relevant for JavaScript is that the first 265 code points match one of these prior encodings, namely ISA8591, that doesn't mean by itself that it is compatible with ASCII, because that's only the code points, not the actual transformation to byte sequences. But then you have multiple encodings to do that, and the one that we all know and use everyday is UTF-8, and this one is backwards compatible with ASCII because, you know, the first 127 bytes match ASCII exactly, and it uses all the other bytes to, you know, represent other characters that don't fit into that range.

And then there's UTF-16, which JavaScript people might also care about from time to time, where the idea is more closely to, you know, two bytes per character. This made a lot of sense when Unicode was first introduced because back then, you know, nobody expected that there might be more than 65,000 characters to care about. So, you know, two byte was a very natural choice for that. But with things like emoji being introduced, we're going to—we've stepped outside that range. So some things have to be represented by pairs of two bytes, so four bytes in total. So people sometimes say that JavaScript uses UTF-16, and like, well, there might be something to that. So I have here the output of the Unicode command line utility. If you've never used that, it is a very neat tool for finding out information about individual characters or looking up characters based on their code points, all that kind of stuff. However, I wrote that, I am very thankful. There is an example of what this looks like in UTF-16. I've highlighted that. And then, what happens when you use Node to print out the length of a string that only contains this single hamster face character? It says two, even though it's one character. And then you can dig further and you see that like, this one character compares equal to a string comprised of two escape sequences. And these escape sequences happen to match exactly how UTF-16 serializes things. And so you might say, well, JavaScript uses UTF-16. I'm done. The reality is that UTF-16 is a character encoding. It's a way of transforming sequences of characters into sequences of bytes. There is no sequence of bytes in here. This is not an encoding thing. It just happens to have some similarities. So in some ways, JavaScript lets you interact with strings as if they were stored using UTF-16.

QnA

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Scaling Up with Remix and Micro Frontends
Remix Conf Europe 2022Remix Conf Europe 2022
23 min
Scaling Up with Remix and Micro Frontends
Top Content
Do you have a large product built by many teams? Are you struggling to release often? Did your frontend turn into a massive unmaintainable monolith? If, like me, you’ve answered yes to any of those questions, this talk is for you! I’ll show you exactly how you can build a micro frontend architecture with Remix to solve those challenges.
Full Stack Components
Remix Conf Europe 2022Remix Conf Europe 2022
37 min
Full Stack Components
Top Content
Remix is a web framework that gives you the simple mental model of a Multi-Page App (MPA) but the power and capabilities of a Single-Page App (SPA). One of the big challenges of SPAs is network management resulting in a great deal of indirection and buggy code. This is especially noticeable in application state which Remix completely eliminates, but it's also an issue in individual components that communicate with a single-purpose backend endpoint (like a combobox search for example).
In this talk, Kent will demonstrate how Remix enables you to build complex UI components that are connected to a backend in the simplest and most powerful way you've ever seen. Leaving you time to chill with your family or whatever else you do for fun.
Making JavaScript on WebAssembly Fast
JSNation Live 2021JSNation Live 2021
29 min
Making JavaScript on WebAssembly Fast
Top Content
JavaScript in the browser runs many times faster than it did two decades ago. And that happened because the browser vendors spent that time working on intensive performance optimizations in their JavaScript engines.Because of this optimization work, JavaScript is now running in many places besides the browser. But there are still some environments where the JS engines can’t apply those optimizations in the right way to make things fast.We’re working to solve this, beginning a whole new wave of JavaScript optimization work. We’re improving JavaScript performance for entirely different environments, where different rules apply. And this is possible because of WebAssembly. In this talk, I'll explain how this all works and what's coming next.
Debugging JS
React Summit 2023React Summit 2023
24 min
Debugging JS
Top Content
As developers, we spend much of our time debugging apps - often code we didn't even write. Sadly, few developers have ever been taught how to approach debugging - it's something most of us learn through painful experience.  The good news is you _can_ learn how to debug effectively, and there's several key techniques and tools you can use for debugging JS and React apps.
It's a Jungle Out There: What's Really Going on Inside Your Node_Modules Folder
Node Congress 2022Node Congress 2022
26 min
It's a Jungle Out There: What's Really Going on Inside Your Node_Modules Folder
Top Content
Do you know what’s really going on in your node_modules folder? Software supply chain attacks have exploded over the past 12 months and they’re only accelerating in 2022 and beyond. We’ll dive into examples of recent supply chain attacks and what concrete steps you can take to protect your team from this emerging threat.
You can check the slides for Feross' talk here.
Webpack in 5 Years?
JSNation 2022JSNation 2022
26 min
Webpack in 5 Years?
Top Content
What can we learn from the last 10 years for the next 5 years? Is there a future for Webpack? What do we need to do now?

Workshops on related topic

Using CodeMirror to Build a JavaScript Editor with Linting and AutoComplete
React Day Berlin 2022React Day Berlin 2022
86 min
Using CodeMirror to Build a JavaScript Editor with Linting and AutoComplete
Top Content
WorkshopFree
Hussien Khayoon
Kahvi Patel
2 authors
Using a library might seem easy at first glance, but how do you choose the right library? How do you upgrade an existing one? And how do you wade through the documentation to find what you want?
In this workshop, we’ll discuss all these finer points while going through a general example of building a code editor using CodeMirror in React. All while sharing some of the nuances our team learned about using this library and some problems we encountered.
Node.js Masterclass
Node Congress 2023Node Congress 2023
109 min
Node.js Masterclass
Top Content
Workshop
Matteo Collina
Matteo Collina
Have you ever struggled with designing and structuring your Node.js applications? Building applications that are well organised, testable and extendable is not always easy. It can often turn out to be a lot more complicated than you expect it to be. In this live event Matteo will show you how he builds Node.js applications from scratch. You’ll learn how he approaches application design, and the philosophies that he applies to create modular, maintainable and effective applications.

Level: intermediate
Testing Web Applications Using Cypress
TestJS Summit - January, 2021TestJS Summit - January, 2021
173 min
Testing Web Applications Using Cypress
WorkshopFree
Gleb Bahmutov
Gleb Bahmutov
This workshop will teach you the basics of writing useful end-to-end tests using Cypress Test Runner.
We will cover writing tests, covering every application feature, structuring tests, intercepting network requests, and setting up the backend data.
Anyone who knows JavaScript programming language and has NPM installed would be able to follow along.
Build a powerful DataGrid in few hours with Ag Grid
React Summit US 2023React Summit US 2023
96 min
Build a powerful DataGrid in few hours with Ag Grid
WorkshopFree
Mike Ryan
Mike Ryan
Does your React app need to efficiently display lots (and lots) of data in a grid? Do your users want to be able to search, sort, filter, and edit data? AG Grid is the best JavaScript grid in the world and is packed with features, highly performant, and extensible. In this workshop, you’ll learn how to get started with AG Grid, how we can enable sorting and filtering of data in the grid, cell rendering, and more. You will walk away from this free 3-hour workshop equipped with the knowledge for implementing AG Grid into your React application.
We all know that rolling our own grid solution is not easy, and let's be honest, is not something that we should be working on. We are focused on building a product and driving forward innovation. In this workshop, you'll see just how easy it is to get started with AG Grid.
Prerequisites: Basic React and JavaScript
Workshop level: Beginner
Build and Deploy a Backend With Fastify & Platformatic
JSNation 2023JSNation 2023
104 min
Build and Deploy a Backend With Fastify & Platformatic
WorkshopFree
Matteo Collina
Matteo Collina
Platformatic allows you to rapidly develop GraphQL and REST APIs with minimal effort. The best part is that it also allows you to unleash the full potential of Node.js and Fastify whenever you need to. You can fully customise a Platformatic application by writing your own additional features and plugins. In the workshop, we’ll cover both our Open Source modules and our Cloud offering:- Platformatic OSS (open-source software) — Tools and libraries for rapidly building robust applications with Node.js (https://oss.platformatic.dev/).- Platformatic Cloud (currently in beta) — Our hosting platform that includes features such as preview apps, built-in metrics and integration with your Git flow (https://platformatic.dev/). 
In this workshop you'll learn how to develop APIs with Fastify and deploy them to the Platformatic Cloud.
0 to Auth in an Hour Using NodeJS SDK
Node Congress 2023Node Congress 2023
63 min
0 to Auth in an Hour Using NodeJS SDK
WorkshopFree
Asaf Shen
Asaf Shen
Passwordless authentication may seem complex, but it is simple to add it to any app using the right tool.
We will enhance a full-stack JS application (Node.JS backend + React frontend) to authenticate users with OAuth (social login) and One Time Passwords (email), including:- User authentication - Managing user interactions, returning session / refresh JWTs- Session management and validation - Storing the session for subsequent client requests, validating / refreshing sessions
At the end of the workshop, we will also touch on another approach to code authentication using frontend Descope Flows (drag-and-drop workflows), while keeping only session validation in the backend. With this, we will also show how easy it is to enable biometrics and other passwordless authentication methods.
Table of contents- A quick intro to core authentication concepts- Coding- Why passwordless matters
Prerequisites- IDE for your choice- Node 18 or higher