Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a character repertoire, an encoding methodology and set of standard character encodings, a set of code charts for visual reference, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and rules for normalization, decomposition, collation and rendering.
The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments. Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including XML, the Java programming language and modern operating systems.
Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the ISO 8859 standard which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using Roman characters and the local language) but not multilingual computer processing (computer processing of arbitrary languages mixed with each other).
Unicode, in intent, encodes the underlying characters — graphemes and grapheme-like units — rather than the variant glyphs (renderings) for such characters. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see Han unification).
In text processing, Unicode takes the role of providing a unique code point — a number, not a glyph — for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font or style) to other software, such as a web browser or word processor. This simple aim becomes complicated, however, by concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode.
The first 256 code points were made identical to the content of ISO 8859-1 so as to make it trivial to convert existing western text. A lot of essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "fullwidth forms" section of code points encompasses a full Latin alphabet that is separate from the main Latin alphabet section. In Chinese, Japanese and Korean (CJK) fonts, these characters are rendered at the same width as CJK ideographs rather than at half the width. For other examples, see Duplicate characters in Unicode.
Also, while Unicode allows for combining characters it also contains precomposed versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler and allow applications to use Unicode as an internal text format without having to implement combining characters. For example é can be represented in Unicode as U+0065 (Latin small letter e) followed by U+0301 (combining acute) but it can also be represented as the precomposed character U+00E9 (Latin small letter e with acute).
The Unicode standard also includes a number of related items, such as character properties, text normalisation forms and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).
The Unicode Consortium, based in California, develops the Unicode standard. Any company or individual willing to pay the membership dues may join this organization. Members include virtually all of the main computer software and hardware companies with any interest in text-processing standards, such as Apple, Microsoft, IBM, Xerox, HP, Adobe Systems and many others.
The Consortium first published The Unicode Standard (ISBN 0-321-18578-1) in 1991, and continues to develop standards based on that original work. Unicode developed in conjunction with the International Organization for Standardization, and it shares its character repertoire with ISO/IEC 10646: the Universal Character Set. Unicode and ISO/IEC 10646 function equivalently as character encodings, but The Unicode Standard contains much more information for implementers, covering — in depth — topics such as bitwise encoding, collation and rendering. The Unicode Standard enumerates a multitude of character properties, including those needed for supporting bidirectional text. The two standards do use slightly different terminology.
When writing about a Unicode character, it is normal to write "U+" followed by a hexadecimal number indicating the character's code point. For code points in the BMP, four digits are used; for code points outside the BMP, five or six digits are used, as required. Older versions of the standard used similar notations, but with slightly different rules. For example, Unicode 3.0 used "U-" followed by eight digits, and allowed "U+" to be used only with exactly four digits in order to indicate a code unit, not a code point.
[edit] Unicode revision history
October, 1991 Unicode 1.0 ISBN 0-201-56788-1.
June, 1992 Unicode 1.0.1 ISBN 0-201-60845-6.
June, 1993 Unicode 1.1 Previous 2 Publications, and, Unicode Technical Report #4:The Unicode Standard, Version 1.1 by Mark Davis.
July, 1996 Unicode 2.0 ISBN 0-201-48345-9.
May, 1998 Unicode 2.1
May, 1998 Unicode 2.1.2 Previous 3 Publications, and, Unicode Technical Report #8, The Unicode Standard, Version 2.1 by Lisa Moore.
September, 1999 Unicode 3.0 Covered 16-bit UCS Basic Multilingual Plane (BMP) from ISO 10646-1:2000. ISBN 0-201-61633-5.
March, 2001 Unicode 3.1 Added Supplementary Planes from ISO 10646-2, providing supplementary characters
March, 2002 Unicode 3.2
April, 2003 Unicode 4.0 ISBN 0-321-18578-1.
March, 2004 Unicode 4.0.1
March, 2005 Unicode 4.1
July, 2006 Unicode 5.0 (The character database, aka. UCD, published on July 18, the book, The Unicode Standard, Version 5.0 was released on November 9, 2006; ISBN 0321480910.)
2007-02-06 00:48:16
·
answer #3
·
answered by navmac 2
·
0⤊
0⤋