Unicode is the universal character encoding standard that assigns a unique code point (numeric identifier) to every character in every writing system. Unicode 15.1 (released September 2023) defines 149,813 characters across 161 scripts, covering Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Chinese (CJK Unified Ideographs), Japanese (Hiragana, Katakana, Kanji), Korean (Hangul), Thai, emoji, mathematical symbols, musical notation, and many historic scripts.
Unicode code points are written in U+XXXX format (4 hex digits for Basic Multilingual Plane characters) or U+XXXXX format (5 hex digits for supplementary plane characters). For example, the Latin letter 'A' is U+0041, the Chinese character for 'water' is U+6C34, the Arabic letter 'alef' is U+0627, and the grinning face emoji is U+1F600. In programming, Unicode characters are represented using escape sequences: JavaScript and Java use \\uXXXX (with surrogate pairs for supplementary characters), Python uses \\uXXXX and \\UXXXXXXXX, and HTML uses numeric character references like A or A.
Understanding Unicode code points is essential for debugging encoding issues in software development. Mojibake (garbled text like "é" instead of "e") occurs when text encoded in one character set (like UTF-8) is decoded using a different character set (like Latin-1). By converting text to code points, you can identify exactly which characters are present and diagnose whether the issue is in encoding, decoding, font rendering, or data transmission. This converter supports all 17 Unicode planes and handles surrogate pairs for supplementary characters, making it a comprehensive tool for Unicode analysis and debugging.