English Deutsch Français Italiano Español Português 繁體中文 Bahasa Indonesia Tiếng Việt ภาษาไทย
All categories

Does it depend on which UTF is used? I tried to look for an answer but it is a little confusing.

2007-02-11 17:46:43 · 3 answers · asked by Anonymous in Computers & Internet Programming & Design

3 answers

I think you might be a bit confused as to how Unicode works, exactly.

Each character in Unicode is 2 bytes long (as opposed to ascii, which is one byte). A byte is 8 bits, so two bytes would be 16 bits.

In the ascii 'alphabet', there are 256 different possible characters. The normal english alphabet takes up 26 for lowercase and 26 more for uppercase = 52. That leaves roughly 200 more characters to work with. (They use the rest for some latin and greek characters as well as punctuation, numbers, and a few other symbols.)

The Unicode alphabet allows for up to 65,536 characters. This allows the entire ascii subset as well as characters from just about every other written language known. The Arial font that is included with windows XP contains character for English, Latin, Greek, Farsi, Arabic, Korean, Hindi, Chinese (both simplified and traditional), Japanese, Hebrew, Cirilic, and Armenian... and there's room left over.

To answer your question, since all characters in Unicode use 16 bits to represent a single character, character number 65,536 also uses 16 bits. (open up the windows calculator, switch to scientific mode, select the 'Dec' radio button, punch in 65535 and then hit 'Bin' radio button. Count how many digits are in the resulting number.)

2007-02-11 18:06:03 · answer #1 · answered by Jack Schitt 3 · 2 0

You need 16 bits to represent most Unicode characters. Strictly speaking Unicode needs 21 bits for the full Unicode set (ranged 0 to 1,114,111), but the ones above 65k are exceedingly uncommon.

UTF8 and UTF16 will usually take 16 bits to represent those 65k characters (for UTF8, only 8 bits for the ASCII chars), but they can take up to 32 bits for the rare ones outside that range. There's not really a UTF24 though. UTF32 always takes 32 bits for every character, which is generally overkill.

2007-02-12 05:41:14 · answer #2 · answered by ey 2 · 1 0

Binary values 0 - 65535 are represented by 16 bits.

2007-02-12 01:58:52 · answer #3 · answered by Anonymous · 1 0

fedest.com, questions and answers