Archive for June 2nd, 2008

Unicode sans the Uni

When you pick up your first book about programming, you’ll probably have a guess about what you’re going to learn. You’ll guess you will learn how to add 1 and 1 together with code or how to display the text “Hello, World!” on the screen. And you will. You’ll learn your average text-string consists of a few bytes where every character takes up a byte or 8 bits. Heck, you may even memorize the character ‘0′ is 48 in decimal and a space-character is 0×20 in hexadecimal. Not to mention you’ll be forced to recognize \r\n as 13d,10d or 0×0D, 0×0A.

Eventually however, you’ll come across a fancy word called; internationalization, or “i18n”. If you need to support a foreign language that doesn’t use our a-z alphabet, you will need to look this up. If you started surfing around the Internet in the Microsoft Windows 95 age, you probably had several hits to websites explaining Codepages. Codepages is a handy tool that lets the OS know that the string you’re going to display shouldn’t use the normal ascii representations for those byte-strings of yours, but some other odd character-set. These codepages still exist, but most of the world doesn’t use it anymore in favor of something else.

That something else, is Unicode. And this is where the fun starts, because you’ll go mad when you see this. Well, I did. Unicode is a wonderful idea which incorporates all the characters in the world in one big oddly structured table. The problem about this table is that there are more characters then you can squeeze into 8 bits, and this leads to the great debate that makes you pull out your hair and change professions to something where you don’t have to use your brain.

Unicode is one big mess across programming languages and operating systems. Windows uses 16 bit wchar_t’s and seperate “W” versions of their Windows API, when you compile Linux programs with that same wchar_t you will get 32 bit characters, Linux itself uses UTF-8 by default, Mac OSX uses “canonically decomposed UTF-8″, and Java uses their own default 16 bit unicode strings. And we thought we were just talking about 1 standard called Unicode…

Now, wonder if you’re surprised when you read about this april fools prank

Anyway, I’m not sure what my point was… Oh right. So normally, when you have an UTF-8 string and you want to turn that into a widestring you could find a Posix function called mbstowcs() – multibyte string to wide character string. In combination you need to use setlocale() to let the system know you currently have an UTF-8 string, which can be treated as ‘just another codepage’. However, on Windows you’ll initially read you need to use codepage 650001 for UTF-8, and after testing you’ll find this kind of information on MSDN in small print “If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.” So, instead, on Windows, you need to use the function MultiByteToWideChar() which is of course Windows only.

So what’s the big deal about all these types of Unicode? Well, currently you need 4 bytes to represent all the characters on the Unicode table. However, you can stick to 2 bytes if you just want the majority and the characters that are most used in the world. And of course someone got really upset and wanted to retain ASCII compatibility and made up UTF-8, which has a size ranging from 1 to 4 bytes.

The fun thing about UTF-8 is that unless you’re using weird characters, it doesn’t look any different from normal ASCII strings, and as a bonus you’ll be still able to check the length of the string by the character 0.
UTF-16 and 32 on the other hand are in most cases fixed width, and use 2 to 4 zero’s to indicate the end of a string.

I really thought I had something interesting to talk about when I started this post… I guess not…

Add comment June 2, 2008


RSS Twitter

 

June 2008
S M T W T F S
« May   Jul »
1234567
891011121314
15161718192021
22232425262728
2930  

Categories

Blogroll

Meta

Top Posts