Why Text Encoding Exists
Computers transmit data as bytes, but not all byte values are safe in every context. URLs cannot contain spaces. HTML interprets angle brackets as tags. Email systems may corrupt binary data. Text encoding schemes solve these problems by transforming unsafe characters into safe representations that can be decoded back to the original.
Understanding these encodings is not optional for web developers. Incorrect encoding causes broken links, garbled text, security vulnerabilities (XSS attacks), and silent data corruption that surfaces only in production.
Base64 Encoding
What It Does
Base64 converts binary data into a string of 64 ASCII characters (A-Z, a-z, 0-9, +, /). Every three bytes of input become four Base64 characters, making the encoded output about 33% larger than the original.
When to Use It
Base64 is essential when you need to embed binary data in a text-only context. Common use cases include:
- Embedding images directly in HTML or CSS using data URIs
- Sending binary attachments in email (MIME encoding)
- Storing binary data in JSON, which only supports text values
- Passing binary data through APIs that expect text
How It Works
Base64 takes input bytes in groups of three (24 bits), splits them into four 6-bit groups, and maps each group to one of 64 characters. If the input length is not a multiple of three, padding characters (=) are added to complete the final group.
For example, the text "Hi" (two bytes) becomes "SGk=" in Base64. The "=" is padding because two bytes do not fill a complete three-byte group.
Common Pitfalls
Base64 is encoding, not encryption. It provides zero security — anyone can decode it instantly. Never use Base64 to "hide" sensitive data.
Base64 increases data size by approximately 33%. For large files, this overhead adds up. If you are Base64-encoding images for a web page, consider whether serving the image as a separate file would be more efficient.
URL-safe Base64 replaces "+" with "-" and "/" with "_" to avoid conflicts with URL syntax. Use this variant when Base64 data appears in URLs or filenames.
URL Encoding (Percent Encoding)
What It Does
URL encoding replaces unsafe characters with a percent sign followed by two hexadecimal digits representing the character's byte value. A space becomes %20, an ampersand becomes %26, and a forward slash becomes %2F.
When to Use It
Any data placed into a URL must be properly encoded. This includes:
- Query parameter values: `?search=hello%20world`
- Path segments containing special characters
- Form data submitted via GET requests
- Any user input that becomes part of a URL
Reserved vs. Unreserved Characters
URL syntax reserves certain characters for structural purposes. Ampersands (&) separate query parameters. Equals signs (=) separate keys from values. Question marks (?) begin the query string. These characters must be encoded when they appear as data rather than structure.
Unreserved characters — letters, digits, hyphens, underscores, periods, and tildes — never need encoding.
Double Encoding
A common mistake is encoding data that is already encoded, turning %20 into %2520. This happens when encoding functions are applied more than once, or when a framework automatically encodes data that you have already manually encoded. The result is URLs that look wrong and break when decoded.
Always know which layer of your application is responsible for encoding, and encode exactly once.
HTML Entities
What It Does
HTML entity encoding replaces characters that have special meaning in HTML with named or numeric references. The less-than sign (<) becomes `<`, the greater-than sign (>) becomes `>`, the ampersand (&) becomes `&`, and double quotes (") become `"`.
When to Use It
HTML encoding is critical whenever untrusted text is inserted into an HTML document. Without encoding, user-supplied text containing angle brackets could be interpreted as HTML tags, leading to cross-site scripting (XSS) attacks.
Common contexts requiring HTML encoding:
- Displaying user-generated content on a web page
- Inserting dynamic values into HTML attributes
- Showing code snippets in documentation or tutorials
- Any text that originates outside your application
Named vs. Numeric Entities
HTML supports both named entities (`&`, `<`, `©`) and numeric entities (`&`, `<`, `©`). Named entities are more readable but limited to a predefined set. Numeric entities can represent any Unicode character using decimal (`€` for €) or hexadecimal (`€` for €) notation.
Encoding vs. Escaping
The terms are often used interchangeably, but there is a subtle difference. Encoding transforms data for transmission in a specific format. Escaping prevents special characters from being interpreted as syntax. In practice, HTML entity encoding serves both purposes — it makes text safe for HTML contexts and preserves the original characters for display.
Combining Encodings
Real-world data often passes through multiple encoding layers. A user submits a search query containing an ampersand. The browser URL-encodes it for the HTTP request. The server decodes it and includes it in an HTML response using HTML entity encoding. If the response includes a JSON API call, the data might be JSON-escaped as well.
Each encoding layer must be applied and removed in the correct order. Mixing up the sequence — HTML-encoding before URL-encoding, or forgetting to decode one layer — produces garbled output that is difficult to debug.
Security Implications
Encoding is a first line of defense against injection attacks. SQL injection, XSS, and command injection all exploit situations where data is interpreted as code. Proper encoding ensures that data remains data, regardless of what characters it contains.
Context matters. HTML encoding protects against XSS in HTML contexts but not in JavaScript contexts. URL encoding protects URLs but not HTML attributes. Always encode for the specific context where the data will be used.
Practical Workflow
When working with text that crosses context boundaries, follow this checklist:
1. Identify the target context (URL, HTML, JSON, SQL) 2. Determine which characters are special in that context 3. Apply the appropriate encoding once, at the boundary 4. Decode only when transitioning back to a raw text context 5. Never trust that data is "already encoded" — verify or re-encode at the boundary
Having reliable encoding and decoding tools readily available saves debugging time and prevents security vulnerabilities. Bookmark them, learn the common patterns, and apply them consistently.