Discussing the nuts and bolts of software development

Friday, July 20, 2007

 

How do I convert a wchar_t to a char?

How do I convert a wchar_t to a char? How do I convert a char to a wchar_t?

It depends! Welcome to the wonderful world of string conversion.

You are probably asking this because you have a std::wstring or wchar_t* and need to pass it to a function which takes a std::string or char*. You just need to know the name of the conversion function to use, right? The problem is that the way to do the conversion varies depending on the context of your code.

Before asking how to convert a wchar_t to a char, there are other questions you should be asking first such as "What encoding is my wchar_t using?" and more important "What encoding does my function expect that char param to be in?"

Distinguish Storage From Encoding

char's and wchar_t's just represent storage space, defined by the compiler. For example, Microsoft VC defines a char as 1 byte of storage, and a wchar_t as 2 bytes. Some versions of GCC define wchar_t as 4 bytes.

What matters is the encoding used for the data contained within those bytes. Is it ascii? If so, what code page is being used? Or maybe it's unicode? If so, what unicode encoding is being used?

The encoding tells you how to interpret the data, and thus how you'll need to convert it.

Encoding Within The Wide String

You need to know the encoding of the wide string, ie. the way to interpret the data stored in each wchar_t. This is typically UTF-16 or UCS-2 when wchar_t's are 2 bytes, UTF-32 when wchar_t's are 4 bytes.

For example, let's say you had an array of wchar_t on Windows and the first wchar_t's value in hex was 2D25. This is probably UTF-16 encoding, and represents the Georgian small letter 'hoe' (http://www.unicode.org/charts/PDF/U2D00.pdf).

But who knows? Although very unlikely, it's possible that the string was read in from a source which was stuffed with 2 UTF-8 characters in each wchar_t, in which case the value 2D25 represents '-' (2D) and '%' (25).

The point is, in order to know what encoding your source string is in, you must understand the context of the code. How did you obtain this string? If it was from a Windows file function, then probably it is UTF-16 encoded. If it was from a 3rd-party library, then consult the 3rd party documentation just to be sure.

Encoding Within The Narrow String

Next, you need to know the encoding that the function expects the char* parameter to be in. Again, it's all about context. Consult the function documentation. Old unix-style functions such as unlink and rmdir generally expect the char* string to be ascii-encoded using the current locale of the OS. Other functions from 3rd-party libraries might expect the char* to be a UTF-8 encoded string, etc.

_Be careful!_ It's easy to confuse us-ascii with UTF-8 because the first 128 characters (hex values 00 to 7F) represent the same symbols. For example, value 6B represents 'k' in both us-ascii and UTF-8. It's only once you get into higher values that they get out of synch. This is why developers often think they got the used the right encoding, until their product ships internationally, and some important executives freak out because the é, ç and ä in their names are garbled.

Time To Convert!

Once you know the platform, source encoding and destination encoding, you are ready to convert. There are numerous different conversion utilities on the web, so start searching! If you google "Convert UTF16 to UTF8 on Windows" for example, this can yield better results than "convert wchar_t to char".

I did 5 minutes of googling just now and was able to find a few links to get you started:

Converting from UTF16 to UTF8 and vice versa on Windows:
http://www.codeproject.com/useritems/UtfConverter.asp

Converting from UTF16 to a given Windows code page:
http://msdn2.microsoft.com/en-us/library/ms776420.aspx

Converting to ascii using the current code page on Unix:
http://www.scit.wlv.ac.uk/cgi-bin/mansec?3C+wcstombs

Lossless and Lossy Conversions

_WARNING:_ If you are converting from one unicode encoding to another (for example, UTF-16 to UTF-8), then this will be a lossless conversion. You can convert back and forth as many times as you need without losing any encoding information.

If on the other hand you convert a unicode string to an ascii string, this is a lossy conversion. You will not be able to convert back without knowledge of the ascii code page or locale used for the initial conversion.

For example, if I send a char* ascii string over the wire to another machine, the receiving end will not be able to convert it back to a wchar_t* unicode string without knowing what locale my machine was in when I built the ascii string in the first place.

If there's one principle to remember when working with string conversion, it's that a good programmer is aware of the context in which he's working at all times. Take the extra minute to understand the source and destination encodings, the platform and the locale, and you will be rewarded with a lower bug count once your localized product hits international markets.

Labels: , ,


This page is powered by Blogger. Isn't yours?