The hierarchy in one sentence
Unicode is the standard that assigns a number to every character. UTF-8 is an encoding scheme that represents those numbers as bytes. ASCII is an older standard that works for only 128 characters — a subset of both.
ASCII: the original standard
ASCII (American Standard Code for Information Interchange) was defined in 1963. It maps 128 characters to numbers 0–127 — the English alphabet, digits, punctuation, and control characters. Every "A" is the number 65. Every "a" is 97.
The limitation: 128 characters is only enough for English. No accented letters (é, ü, ñ), no Chinese, Arabic, Hebrew, or any other writing system. No emoji.
Unicode: the universal standard
Unicode was created in 1991 to assign a unique number — called a code point — to every character in every writing system. The current Unicode standard (15.1) defines over 149,000 characters across 161 scripts, plus emoji.
Examples:
- A → U+0041
- é → U+00E9
- 中 → U+4E2D
- 😀 → U+1F600
Unicode defines the characters — it does not define how to store them as bytes in a file or transmit them over a network. That is what encoding schemes like UTF-8 are for.
UTF-8: the dominant encoding
UTF-8 is a variable-width encoding for Unicode. It represents each character using 1 to 4 bytes:
- ASCII characters (U+0000 to U+007F): 1 byte — identical to ASCII representation
- Most Latin-script accented letters, Greek, Cyrillic: 2 bytes
- Most East Asian characters: 3 bytes
- Emoji, rare characters: 4 bytes
The genius of UTF-8: it is fully backward-compatible with ASCII. Any file that only contains ASCII characters is a valid UTF-8 file. This is why UTF-8 became the universal encoding for the web — it works for every language while adding no overhead for English text.
UTF-8 is the encoding for virtually all modern web content: HTML pages, JSON, CSS, most source code files, and most databases. If in doubt, use UTF-8.
UTF-16 and UTF-32
Other Unicode encodings exist:
- UTF-16: uses 2 or 4 bytes per character. Used internally by Windows, Java, and JavaScript strings. Compact for East Asian text, wasteful for ASCII.
- UTF-32: always 4 bytes per character. Simple but wastes space. Rarely used in practice.
Why encoding errors happen: the mismatch
An encoding error occurs when text written in one encoding is read in another. Common scenarios:
- CSV file saved as Latin-1 (ISO-8859-1) opened in a tool expecting UTF-8:accented characters like é (E9 in Latin-1) appear as garbled multi-byte sequences
- MySQL table set to latin1 but receiving UTF-8 data:characters above ASCII are stored incorrectly
- Email with no Content-Type charset declaration:mail clients guess the encoding and sometimes guess wrong
Mojibake: the encoding error you have seen
"Mojibake" (文字化け) is the Japanese term for garbled text from encoding mismatch. The classic example: é encoded as UTF-8 (0xC3 0xA9) read as Windows-1252 produces é.
If you see é where é should be, or  where nothing should be, the text was encoded as UTF-8 but decoded as Latin-1 or Windows-1252.
How to fix encoding problems
- In HTML: always include
<meta charset="UTF-8">in the<head> - In MySQL: set the database, table, and column charset to
utf8mb4(notutf8, which is MySQL's broken 3-byte implementation that can't store emoji) - For CSV files: specify UTF-8 when opening in Excel (Import → File Origin → UTF-8)
- In Python: use
open(file, encoding='utf-8')explicitly
Summary
- ASCII: 128 characters, English only, the original standard
- Unicode: 149,000+ characters from all writing systems — assigns numbers, not bytes
- UTF-8: the dominant encoding for Unicode — variable width, backward-compatible with ASCII, use this by default
- Encoding errors are caused by reading text with the wrong encoding — the fix is declaring the correct encoding explicitly everywhere
Need to encode text for safe transport? The Base64 encoder and URL encoder handle the most common encoding scenarios.