🔤 What is File Encoding
File encoding is the fundamental mechanism by which computers store and process text characters. Simply put, encoding is a rule system that converts human-readable characters into numbers (binary) that computers can understand.
Imagine a computer as a giant storage cabinet where each compartment can only hold numbers like 0 or 1. When we want to store the letter "A" or Chinese character "中", we need a set of rules to determine which numbers represent these characters.
💡 The Essence of Encoding
Encoding = Character ↔ Number Mapping Relationship
- Characters: Human-readable symbols (A, 中, @, 😊)
- Encoding values: Corresponding numeric codes
- Binary: The actual 0s and 1s stored by computers
🔢 ASCII Encoding Explained
ASCII Encoding Principles
ASCII (American Standard Code for Information Interchange) is the earliest character encoding standard, using 7-bit binary numbers to represent characters, capable of representing 128 different characters.
ASCII Encoding Example for Character 'A'
Character: A
ASCII Code: 65
Binary Representation:
Storage Method: Occupies 1 byte (8 bits) in computer memory, with 7 effective bits
ASCII Encoding Characteristics:
- Each character occupies 1 byte (8 bits)
- Value range: 0-127
- Can only represent English letters, numbers, and basic symbols
- Cannot represent Chinese, Japanese, or other non-Latin characters
🌐 UTF-8 Encoding Mechanism
UTF-8 Variable-Length Encoding
UTF-8 is a variable-length encoding that uses 1-4 bytes to represent different characters. It is backward compatible with ASCII while being able to represent almost all characters in the world.
UTF-8 Encoding Example for Chinese Character '中'
Character: 中
Unicode Code Point: U+4E2D (Decimal: 20013)
UTF-8 Encoding:
Storage Analysis:
- Occupies 3 bytes (24 bits)
- 1st byte: 11100100 - Identifies the start of a 3-byte character
- 2nd byte: 10111000 - Continuation byte
- 3rd byte: 10101101 - Continuation byte
UTF-8 Encoding Rules
Character Range | Byte Count | Binary Format | Examples |
---|---|---|---|
U+0000 - U+007F | 1 byte | 0xxxxxxx | A (ASCII compatible) |
U+0080 - U+07FF | 2 bytes | 110xxxxx 10xxxxxx | é, ñ |
U+0800 - U+FFFF | 3 bytes | 1110xxxx 10xxxxxx 10xxxxxx | 中, 日, 한 |
U+10000 - U+10FFFF | 4 bytes | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 😊, 𝕏 |
🔄 UTF-16 Encoding Principles
UTF-16's Fixed and Variable Length Combination
UTF-16 primarily uses 2 bytes (16 bits) to represent characters, and for characters beyond the Basic Multilingual Plane, it uses a 4-byte surrogate pair mechanism.
UTF-16 Encoding Example for Chinese Character '中'
Character: 中
Unicode Code Point: U+4E2D
UTF-16 Encoding:
Storage Analysis:
- Occupies 2 bytes (16 bits)
- Directly uses Unicode code point value
- Saves 1 byte compared to UTF-8
⚖️ Encoding Comparison Analysis
Storage Space Comparison
Character Type | Example | ASCII | UTF-8 | UTF-16 |
---|---|---|---|---|
English Letters | A | 1 byte | 1 byte | 2 bytes |
Chinese Characters | 中 | Not supported | 3 bytes | 2 bytes |
Emoji | 😊 | Not supported | 4 bytes | 4 bytes |
Encoding Characteristics Summary
🎯 Encoding Selection Recommendations
- UTF-8: First choice for web pages, APIs, and cross-platform applications
- UTF-16: Commonly used in Windows systems, Java, and .NET applications
- ASCII: Only suitable for pure English environments
🧪 Encoding Conversion Demo
Enter a character to see different encoding representations:
Please enter a character to view encoding results
🛠️ Practical Applications
Common Encoding Issues
🚨 Causes of Garbled Text
- Encoding and decoding using different character sets
- Incorrect encoding settings when saving files
- Web pages not properly declaring character encoding
- Encoding loss during data transmission
Solutions
- Consistently use UTF-8 encoding
- Properly declare charset in HTML
- Specify encoding when connecting to databases
- Use professional encoding conversion tools