·7 min read·Blog

UTF-8, ASCII, Unicode: What's the Difference and Why Does It Matter?

The mysterious box characters, question marks, and garbled accented letters come from encoding mismatches. Here's what each term means and how to fix the problems they cause.

The hierarchy in one sentence

Unicode is the standard that assigns a number to every character. UTF-8 is an encoding scheme that represents those numbers as bytes. ASCII is an older standard that works for only 128 characters — a subset of both.

ASCII: the original standard

ASCII (American Standard Code for Information Interchange) was defined in 1963. It maps 128 characters to numbers 0–127 — the English alphabet, digits, punctuation, and control characters. Every "A" is the number 65. Every "a" is 97.

The limitation: 128 characters is only enough for English. No accented letters (é, ü, ñ), no Chinese, Arabic, Hebrew, or any other writing system. No emoji.

Unicode: the universal standard

Unicode was created in 1991 to assign a unique number — called a code point — to every character in every writing system. The current Unicode standard (15.1) defines over 149,000 characters across 161 scripts, plus emoji.

Examples:

  • A → U+0041
  • é → U+00E9
  • 中 → U+4E2D
  • 😀 → U+1F600

Unicode defines the characters — it does not define how to store them as bytes in a file or transmit them over a network. That is what encoding schemes like UTF-8 are for.

UTF-8: the dominant encoding

UTF-8 is a variable-width encoding for Unicode. It represents each character using 1 to 4 bytes:

  • ASCII characters (U+0000 to U+007F): 1 byte — identical to ASCII representation
  • Most Latin-script accented letters, Greek, Cyrillic: 2 bytes
  • Most East Asian characters: 3 bytes
  • Emoji, rare characters: 4 bytes

The genius of UTF-8: it is fully backward-compatible with ASCII. Any file that only contains ASCII characters is a valid UTF-8 file. This is why UTF-8 became the universal encoding for the web — it works for every language while adding no overhead for English text.

UTF-8 is the encoding for virtually all modern web content: HTML pages, JSON, CSS, most source code files, and most databases. If in doubt, use UTF-8.

UTF-16 and UTF-32

Other Unicode encodings exist:

  • UTF-16: uses 2 or 4 bytes per character. Used internally by Windows, Java, and JavaScript strings. Compact for East Asian text, wasteful for ASCII.
  • UTF-32: always 4 bytes per character. Simple but wastes space. Rarely used in practice.

Why encoding errors happen: the mismatch

An encoding error occurs when text written in one encoding is read in another. Common scenarios:

  • CSV file saved as Latin-1 (ISO-8859-1) opened in a tool expecting UTF-8:accented characters like é (E9 in Latin-1) appear as garbled multi-byte sequences
  • MySQL table set to latin1 but receiving UTF-8 data:characters above ASCII are stored incorrectly
  • Email with no Content-Type charset declaration:mail clients guess the encoding and sometimes guess wrong

Mojibake: the encoding error you have seen

"Mojibake" (文字化け) is the Japanese term for garbled text from encoding mismatch. The classic example: é encoded as UTF-8 (0xC3 0xA9) read as Windows-1252 produces é.

If you see é where é should be, or  where nothing should be, the text was encoded as UTF-8 but decoded as Latin-1 or Windows-1252.

How to fix encoding problems

  • In HTML: always include <meta charset="UTF-8">in the <head>
  • In MySQL: set the database, table, and column charset to utf8mb4(not utf8, which is MySQL's broken 3-byte implementation that can't store emoji)
  • For CSV files: specify UTF-8 when opening in Excel (Import → File Origin → UTF-8)
  • In Python: use open(file, encoding='utf-8') explicitly

Summary

  • ASCII: 128 characters, English only, the original standard
  • Unicode: 149,000+ characters from all writing systems — assigns numbers, not bytes
  • UTF-8: the dominant encoding for Unicode — variable width, backward-compatible with ASCII, use this by default
  • Encoding errors are caused by reading text with the wrong encoding — the fix is declaring the correct encoding explicitly everywhere

Need to encode text for safe transport? The Base64 encoder and URL encoder handle the most common encoding scenarios.

Browse by category

Not sure which tool you need? Start with a category.

Everything you can do — for free

No software to buy. No account to create. Just open a tool and get it done.

Work with images

Compress photos before sending them by email, resize pictures for social media, remove backgrounds, or pick the perfect color for a design project — all without installing any app.

Edit and format text

Count words and characters in an essay, compare two documents side by side, convert text to different formats, or generate placeholder text for a presentation.

Stay safe online

Create a strong unique password in one click, check how secure a password is, encode or decode data, and generate secure tokens — your data never leaves your device.

Calculate anything

BMI, loan repayments, unit conversions, date differences, and dozens of other everyday calculations — no spreadsheet or formula knowledge required.

The Free AI Tools is a free collection of 221+ online tools that work directly in your web browser — no download, no installation, no account required. Whether you need to compress an image for email, count words in an essay, generate a strong password, create a QR code for your business, or format JSON for development — you will find a simple, free tool here.

Every tool is privacy-first: your files, text, and data never leave your device. Tools cover image editing, text processing, developer utilities, security & encoding, SEO & web, design & CSS, and more.

☕ Support Us