this one one of those things that people point to when comparing languages, but ...

ladberg · on June 2, 2023

I think to some extent it depends on the language. In the article they talk about Swift's implementation, which by default does the slow, fancy thing (but makes it easy to do the dumb, fast thing). String manipulation in Swift is almost certainly going to be used for a GUI for end users of many possible languages / locales, so it makes sense to spend the extra cycles to get the fancy version by default. If it isn't the default then you'll end up with half the apps on the App Store displaying broken text on line breaks, ellipses, wrapping, etc. on their hand-rolled UI stack.

jodrellblank · on June 2, 2023

For anyone wondering what Go does, it looks like Python2's way[1]; strings are byte sequences with no guarantees of UTF{anything} correctness. Go's source code is specified to be UTF8 so string literals in source code will become valid UTF8 encoded strings, but any string from any library call or code you didn't write might contain invalid Unicode text, or mixed encodings, or anything.

That feels a bit "pit of despair" design[2], the default thing is unhelpful and doing more than that requires the programmer to climb up out of it.

[1] https://go.dev/blog/strings

[2] https://blog.codinghorror.com/falling-into-the-pit-of-succes...

hnfong · on June 2, 2023

The sad thing is returning unicode code points is probably not going properly do what you wanted to do either... sliding down the slippery slope, you'd end up needing a text layout renderer and a language model to do what you thought you wanted to do. (and then there'll be a thousand bugs and edge cases that your libraries didn't handle properly)

lawn · on June 2, 2023

Nah, that's just dumb. Rust's way of all strings being utf-8 and providing the different lengths depending on your needs is far superior.

If you want something else than utf-8 you can use another data type, like a vector of bytes.

Mawr · on June 2, 2023

According to the article, Rust does the same thing - "<emoji>".len() == 17.

adastra22 · on June 2, 2023

"<emoji>".chars().count() == 5

Rust gives you the freedom to specify what you mean.

tialaramex · on June 2, 2023

Sure, however that's actually decoding the string into Unicode scalar values, and then counting them whereas the length of the string is a direct property of the string reference (it's a fat pointer [address + length])

I don't remember, but I think the size hint is set on the Chars iterator, so it can see it has 17 bytes of data, it knows that can't encode more than 17 Unicode scalar values, nor can it encode fewer than five. But since we ask for an exact count that hint is unused, the actual decoding will take place.

adastra22 · on June 2, 2023

Yes, your point? That is the same thing which happens in Swift if you request the length of a string and it gives you the number of glyphs (1, in this case).

Rust doesn't take sides here. It exposes all the different ways you might want to calculate the "length" of a string, and lets you pick which one you mean. The non-zero-cost choices involve a multi-step specification (like `.chars().count()`), which states explicitly the calculation involved.

tialaramex · on June 2, 2023

Asking str.len() is a single very cheap operation, it's not only O(1) in the sense you'd learn in an algorithms course, it's really actually very cheap to do, it's fine if an algorithm relies heavily on str.len()

In contrast chars().count() creates an iterator and runs the iterator to completion counting steps, that's O(N) for a string of length N, and is in practice very expensive, you should definitely cache this value if you will need it repeatedly. It is possible the compiler can see what you're doing and cache it, but I am very far from certain so you should do so explicitly.

This is important in contrast to say, C, where strlen(str) is O(N) because it doesn't have fat pointers and so it has no idea how long the string is in any sense.

insanitybit · on June 2, 2023

Yeah but unfortunately it provides `.len()` directly. It's documented to make clear that it's the bytecount and not the characters, and that humans usually work with characters, but given that this isn't even a trait implementation I think `.as_bytes().len()` or something would have been better.

ttlsa · on June 2, 2023

len([]rune("<emoji>")) does the same thing in Go.

chromoblob · on June 2, 2023

This is only if you want strings to be sequences of bytes. If you want strings to be sequences of code points, it is more sensible to define string length as the length of the sequence. I prefer the latter (for coded text) because it is closer to the meaning of the string. Sequence of code points is always sequence of code points, but a sequence of bytes may not correctly encode a sequence of code points, and bytes in encoding are not in one-to-one correspondence with code points in string. So I see no reason to care about individual bytes per se in the string's code.

eviks · on June 2, 2023

Why would you want dumb???

(and it's not expected that a character's length is>1 unless you've been conditioned to excpect it)

Mawr · on June 2, 2023

Because whenever you want to store or transmit a string only the byte count matters (the size of the string). All the fancy unicode stuff on top of bytes is for the display layers to handle. The default should be grounded to the reality of the programmer.

wizofaus · on June 2, 2023

Storing and transmitting is always going to work with low-level storage units like bytes, so your string will need to be converted to that first. But string manipulation is extremely common in programming, and I would think graphemes are the most useful unit here - i.e. as a programmer my preference would be for swift's behaviour.

eviks · on June 2, 2023

Human interaction is a more grounded reality for programmers vs. the dumb land of pure bytes, so even at that conceptual level the default should be smart

And bytes is the only thing that matter for a specific type of string, conveniently named, sequence of bytes