Why Characters Need Encoding
Computers store everything as numbers. When you type the letter "A" on your keyboard, the computer does not store the shape of the letter — it stores a number that represents it. The system that maps characters to numbers is called a character encoding. Getting this mapping wrong is how you end up with garbled text, question marks where accents should be, or the infamous diamond-with-a-question-mark symbol that haunts poorly configured websites.
Understanding encoding is not just an academic exercise. If you have ever pasted text from one application into another and watched special characters turn into gibberish, you have experienced an encoding mismatch firsthand. Knowing the basics helps you avoid these problems and fix them when they appear.
ASCII: Where It All Started
The American Standard Code for Information Interchange, or ASCII, was published in 1963. It assigns numbers 0 through 127 to 128 characters, covering the English alphabet (uppercase and lowercase), digits 0 through 9, punctuation marks, and a set of control characters like tab, newline, and carriage return.
ASCII was elegant in its simplicity. Seven bits were enough to represent every character, and it worked perfectly for English text. The problem was obvious: the world has far more than 128 characters. Languages like French, German, and Spanish need accented letters. Chinese, Japanese, and Korean use thousands of distinct characters. Arabic and Hebrew are written right to left. ASCII had no room for any of this.
The Code Page Era
To handle characters beyond ASCII, vendors created extended character sets known as code pages. These used the 8th bit (values 128 through 255) to add 128 more characters. ISO 8859-1 (Latin-1) covered Western European languages. ISO 8859-5 handled Cyrillic. Shift_JIS and EUC-JP served Japanese text. Windows had its own variations, like Windows-1252.
The fundamental problem with code pages was that the same byte value meant different characters in different encodings. Byte 0xC0 might be "A with grave accent" in Latin-1 but a completely different character in a Cyrillic encoding. If you opened a file without knowing which code page was used, you got mojibake — scrambled, unreadable text.
This was not a minor inconvenience. It created real barriers to international communication, data exchange, and software development. Every application had to track which encoding each piece of text used, and mistakes were frequent.
Unicode: One Number for Every Character
The Unicode Consortium set out to solve this problem permanently by assigning a unique number — called a code point — to every character in every writing system. The first version in 1991 covered roughly 7,000 characters. Today, Unicode 15.1 defines over 149,000 characters spanning 161 scripts, plus thousands of symbols, technical marks, and emoji.
Unicode code points are written in the format U+XXXX, where XXXX is a hexadecimal number. U+0041 is the Latin capital letter A. U+4E16 is the Chinese character meaning "world." U+1F600 is the grinning face emoji. Every character has exactly one code point, no matter which platform or application you use.
But Unicode is a character set, not an encoding. It tells you what number represents each character. It does not specify how those numbers should be stored as bytes in a file. That is where encoding formats like UTF-8, UTF-16, and UTF-32 come in.
UTF-8: The Encoding That Won
UTF-8, designed by Ken Thompson and Rob Pike in 1992, encodes Unicode code points using one to four bytes. Its brilliance lies in backward compatibility: ASCII characters (U+0000 through U+007F) use exactly one byte, and that byte is identical to its ASCII value. This means any valid ASCII file is also a valid UTF-8 file with no conversion needed.
Characters beyond ASCII use multi-byte sequences. Latin accented characters and many symbols use two bytes. Characters from most Asian scripts use three bytes. Emoji and rare historical scripts use four bytes. The first byte of a multi-byte sequence tells the decoder how many bytes to expect, making the format self-synchronizing — you can jump to any point in a UTF-8 stream and find the start of the next character without reading from the beginning.
UTF-8 has become the dominant encoding on the internet. As of 2024, over 98 percent of all web pages use UTF-8. It is the default encoding in HTML5, JSON, XML, and most modern programming languages. When you create a new file today, UTF-8 is almost certainly the right choice.
UTF-16 and UTF-32
UTF-16 uses two bytes for characters in the Basic Multilingual Plane (code points up to U+FFFF) and four bytes for characters beyond that range. Java and JavaScript use UTF-16 internally, which is why string length in these languages sometimes gives unexpected results for emoji — a single emoji might count as two "characters" because it requires a surrogate pair.
UTF-32 uses exactly four bytes for every character. This makes random access simple — the 50th character is always at byte offset 200 — but wastes significant space for text that is primarily ASCII or Latin-based. UTF-32 is rarely used for file storage or transmission.
Why Encoding Still Matters
Even in a world converging on UTF-8, encoding problems persist. Legacy systems still produce files in older encodings. CSV exports from some spreadsheet applications default to the system locale encoding rather than UTF-8. Database columns might be configured with a specific encoding that does not match the application layer. Email headers can specify one encoding while the body uses another.
The Byte Order Mark (BOM) adds another layer of complexity. Some applications prepend the bytes EF BB BF to UTF-8 files. While the Unicode standard permits this, it causes problems with Unix tools, JSON parsers, and shell scripts that do not expect invisible bytes at the start of a file.
When working with text, always verify which encoding you are using. Specify UTF-8 explicitly when creating files, opening database connections, or sending HTTP responses. When receiving text from external sources, check the declared encoding before processing.
Practical Takeaways
Use UTF-8 for everything new. There is no reason to choose any other encoding for new projects, files, or databases. If you encounter garbled text, the most likely cause is an encoding mismatch — the file was written in one encoding and read in another. Tools that convert between encodings can fix existing files, but the long-term solution is always to standardize on UTF-8.
Character encoding is invisible infrastructure. When it works, nobody notices. When it breaks, text becomes unreadable. Understanding how it works puts you in a position to prevent problems before they happen and diagnose them quickly when they do.