File Encoding Guide: ASCII, UTF-8, UTF-16 Principles & Differences

🔤 What is File Encoding

File encoding is the fundamental mechanism by which computers store and process text characters. Simply put, encoding is a rule system that converts human-readable characters into numbers (binary) that computers can understand.

Imagine a computer as a giant storage cabinet where each compartment can only hold numbers like 0 or 1. When we want to store the letter "A" or Chinese character "中", we need a set of rules to determine which numbers represent these characters.

💡 The Essence of Encoding

Encoding = Character ↔ Number Mapping Relationship

Characters: Human-readable symbols (A, 中, @, 😊)
Encoding values: Corresponding numeric codes
Binary: The actual 0s and 1s stored by computers

🔢 ASCII Encoding Explained

ASCII Encoding Principles

ASCII (American Standard Code for Information Interchange) is the earliest character encoding standard, using 7-bit binary numbers to represent characters, capable of representing 128 different characters.

ASCII Encoding Example for Character 'A'

Character: A

ASCII Code: 65

Binary Representation:

01000001

Storage Method: Occupies 1 byte (8 bits) in computer memory, with 7 effective bits

ASCII Encoding Characteristics:

Each character occupies 1 byte (8 bits)
Value range: 0-127
Can only represent English letters, numbers, and basic symbols
Cannot represent Chinese, Japanese, or other non-Latin characters

🌐 UTF-8 Encoding Mechanism

UTF-8 Variable-Length Encoding

UTF-8 is a variable-length encoding that uses 1-4 bytes to represent different characters. It is backward compatible with ASCII while being able to represent almost all characters in the world.

UTF-8 Encoding Example for Chinese Character '中'

Character: 中

Unicode Code Point: U+4E2D (Decimal: 20013)

UTF-8 Encoding:

11100100 10111000 10101101 E4 B8 AD

Storage Analysis:

Occupies 3 bytes (24 bits)
1st byte: 11100100 - Identifies the start of a 3-byte character
2nd byte: 10111000 - Continuation byte
3rd byte: 10101101 - Continuation byte

UTF-8 Encoding Rules

Character Range	Byte Count	Binary Format	Examples
U+0000 - U+007F	1 byte	0xxxxxxx	A (ASCII compatible)
U+0080 - U+07FF	2 bytes	110xxxxx 10xxxxxx	é, ñ
U+0800 - U+FFFF	3 bytes	1110xxxx 10xxxxxx 10xxxxxx	中, 日, 한
U+10000 - U+10FFFF	4 bytes	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	😊, 𝕏

🔄 UTF-16 Encoding Principles

UTF-16's Fixed and Variable Length Combination

UTF-16 primarily uses 2 bytes (16 bits) to represent characters, and for characters beyond the Basic Multilingual Plane, it uses a 4-byte surrogate pair mechanism.

UTF-16 Encoding Example for Chinese Character '中'

Character: 中

Unicode Code Point: U+4E2D

UTF-16 Encoding:

01001110 00101101 4E 2D

Storage Analysis:

Occupies 2 bytes (16 bits)
Directly uses Unicode code point value
Saves 1 byte compared to UTF-8

⚖️ Encoding Comparison Analysis

Storage Space Comparison

Character Type	Example	ASCII	UTF-8	UTF-16
English Letters	A	1 byte	1 byte	2 bytes
Chinese Characters	中	Not supported	3 bytes	2 bytes
Emoji	😊	Not supported	4 bytes	4 bytes

Encoding Characteristics Summary

🎯 Encoding Selection Recommendations

UTF-8: First choice for web pages, APIs, and cross-platform applications
UTF-16: Commonly used in Windows systems, Java, and .NET applications
ASCII: Only suitable for pure English environments

🧪 Encoding Conversion Demo

Enter a character to see different encoding representations:

Please enter a character to view encoding results

🛠️ Practical Applications

Common Encoding Issues

🚨 Causes of Garbled Text

Encoding and decoding using different character sets
Incorrect encoding settings when saving files
Web pages not properly declaring character encoding
Encoding loss during data transmission

Solutions

Consistently use UTF-8 encoding
Properly declare charset in HTML
Specify encoding when connecting to databases
Use professional encoding conversion tools

File Encoding Guide

📋 Table of Contents