Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think the problem is believing that one character set or character encoding is suitable for everything, and that it has one definition. Neither is true.

Sometimes the restriction is appropriate, but sometimes a variant without this restriction is appropriate, and sometimes Unicode is not appropriate at all. The "artificial restriction" in UTF-8 is legitimate (since they are not valid Unicode characters) but should not apply for all kinds of uses; the problem is programs that apply them when they shouldn't be applied because of limitations in the design.

I think that using a sequence of bytes as the file name and passwords is better, and that file names and passwords being case sensitive is also better.

However, I think "WTF-8" specifically means that mismatched surrogates can be encoded, in case you want to convert to/from invalid UTF-16. Sometimes you might use a different variant of UTF-8, that can go beyond the Unicode range, or encode null characters without null bytes, etc. Sometimes it is better to use different Unicode encodings, or different non-Unicode encodings (which cannot necessarily be converted to Unicode; don't assume that you can or should convert them), or to care only that it is ASCII (or any extension of ASCII without caring about specific extension it is), or to not care about character encoding at all.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: