Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If wchar_t holds the majority of code points for given use, then there are some benefits to having a fixed-width character and certain algorithms.

But it is fairly easy to convert wchar_t to-and-from UTF8 depending on use.

UTF16 is not awful it is the same as an 8-bit character set but twice longer.



UTF-16 is fine so long as you are in Plane 0. Once you have to deal with surrogate pairs, then it really is awful. Once you have to deal with byte-order-markers you might as well just throw in the towel.

UTF-8 is well-designed and has a consistent mechanism for expanding to the underlying code point; it is easy to resynchronize and for ASCII systems (like most protocols) the parsing can be dead simple.

Dealing with Unicode text and glyph handling is always going to be painful because this problem is intrinsically difficult. But expansion of byte strings to unicode code points should not be as difficult as UTF-16 makes it.

Windows was converted to UCS-2 before higher code planes were designed and they never recovered.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: