Unicode is the international standard whose goal is to specify a code matching every character needed by every written human language to a single unique integer number, called a code point. It is the explicit aim of Unicode to abolish traditional character encodings such as those defined by the ISO 8859 standard, which are used in the various countries of the world, but are largely incompatible with each other.
|
The California-based Unicode Consortium first published "The Unicode Standard" in 1991, and continues to develop standards based on that original work. Unicode was developed in conjunction with the International Organization for Standardization and it shares its character repertoire with ISO 10646. Unicode and ISO 10646 are equivalent as character encodings, but The Unicode Standard contains much more information for implementers, covering, in depth, topics such as bitwise encoding, collation, and rendering, and enumerating a multitude of character properties, including those needed for BiDi support. The two standards also have slightly different terminology, although efforts are being made to reconcile the differences.
Unicode reserves 1114112 (a little more than 220) code points, and currently assigns characters to more than 70000 of those code points. The first 256 codes precisely match those of ISO 8859-1, the most popular 8-bit character encoding in the "Western world"; as a result, the first 128 characters are also identical to ASCII.
The Unicode code space for characters is divided into 17 "planes", each plane has 65536 code points. The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned, so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode CJK characters.
Two more planes are used for "graphic" characters. Plane 1, the Supplementary Multilingual Plane (SMP) is mostly used for historic scripts, e.g. Egyptian hieroglyphs (not yet encoded), but is also used for music symbols. Plane 2, the Supplementary Ideographic Plane (SIP) is used for about 40000 rare Chinese characters that are mostly historic, although there are some modern ones.
There is much controversy among Chinese language specialists about the desirability and technical merit of the "Han unification" process used to map multiple Chinese and Japanese character sets into a single set of unified glyphs.
The cap of ~220 code points exists in order to maintain compatibility with the UTF-16 encoding, which can only address that range (see below). The 10% utilisation of the Unicode code space suggests that this ~20 bit limit is unlikely to be reached in the near future.
So far, it was only said that Unicode is a means to assign a unique number for each possible character used by humans. How these numbers are stored in text processing is another matter; problems result from the fact that most of the world's software has so far been written to deal with 8-bit character encodings only, and Unicode support has only been added slowly in recent years.
The internal logic of most legacy software would typically use 8 bits for each character, making it impossible to use more than 256 code points without special processing. Several mechanisms have therefore been suggested to implement Unicode; which one is chosen depends on available storage space, source code compatibility, and interoperability with other systems.
strcmp
for comparisons and trivial sorting) still work, because they operate on 8-bit values. (By contrast, to support the 16- or 32-bit encodings mentioned above, large parts of older software would have to be rewritten.) Third, for most texts that use relatively few non-ASCII characters (that is, texts in most Western languages), the encoding is very space-efficient because it will require only slightly more than 8 bits per character.
The Unicode standard also includes a number of related items, such as character properties, text normalisation forms, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).
Recent web browsers display web pages using Unicode if an appropriate font is installed.
Although syntax rules may affect the order in which characters are allowed to appear, both HTML 4.0 and XML 1.0 documents are, by definition, comprised of characters from the entire range of Unicode code points, minus only a handful of disallowed control characters and the permanently-unassigned code points D800-DFFF and FFFE-FFFF. These characters manifest either directly as bytes according to document's encoding, if the encoding supports them, or they may be written as numeric character references based on the character's Unicode code point, as long as the document's encoding supports the digits and symbols required to write the references (all encodings approved for use on the Internet do). For example, the references Δ
Й
ק
م
๗
ぁ
叶
葉
냻
(or the same numeric values expressed in hexadecimal, with &#x
as the prefix) display on your browser as Δ, Й, ק, م, ๗, ぁ, 叶, 葉 and 냻 -- if you have the proper fonts, these symbols look like the Greek capital letter "Delta", Cyrillic capital letter "Short I", the Arabic letter "Meem", the Hebrew letter "Qof", Thai numeral 7, Japanese Hiragana "A", simplified Chinese "Leaf", traditional Chinese[?] "Leaf", and a Korean Han-geul syllable "Nyrh", respectively.
Free software fonts exist for most (all?) of the characters in the BMP. They may be downloaded freely from the Internet.
Retail fonts that use the Unicode encoding are increasingly common, since first TrueType and now OpenType use Unicode.
It should be noted that a font using a Unicode encoding says nothing about how much of Unicode is supported by the font. There are thousands of Unicode-encoded fonts on the market, but probably fewer than half a dozen fonts that attempt to support most of Unicode. Most fonts focus on a particular script.
Search Encyclopedia
|
Featured Article
|