Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No. Python internally would be made much faster by working on pure UTF-8. Absolutely nothing internal to the language uses the operations that code point semantics speeds up.

Since you mention the varying internal representation of strings: that’s PEP 393 <https://peps.python.org/pep-0393/>, which landed in CPython 3.3, and it generally made things slower by introducing a lot of branching and reallocating and such, though it does speed up some cases due to having to touch less memory, and some methods due to being able to quickly rule out possibilities (e.g. str.isascii can immediately return False for a canonical UCS-2 or UCS-4 string, since if they were ASCII they’d have been of the Latin-1 kind).

PEP 393 was done because people were complaining about how much memory their UCS-4 encoding had been using.

Note also how PEP 393 retains code point semantics: Latin-1 (Unicode values 0–255), UCS-2 or UCS-4; all fixed-width encodings of code point sequences. PEP 393 does also allow a string to cache UTF-8 representation (see PyCompactUnicodeObject.{utf8, utf8_length}), choosing “UTF-8 as the recommended way of exposing strings to C code”, but I gather this isn’t used very much.

(Related: PyPy 7.1 shifted to using UTF-8 exclusively internally, and according to https://www.pypy.org/posts/2019/03/pypy-v71-released-now-use... got a “nice speed bump” out of it.)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: