Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So many of these conversations could be easier if there would not be `length()` functions but `length_in_<whats_exactly>()` functions instead.


In ruby you have " ".codepoints.size == 5 and " ".bytes.size == 17

(It also has `length` which equals codepoints.size)


JavaScript is a weird one. To count UTF-16 bytes you write:

    " ".length
For unicode character count you write:

    [..." "].length
And for grapheme count (or language aware word/sentence count) you write:

    [...new Intl.Segmenter('en-US', { granularity: "grapheme" }).segment(" ")].length
For word/sentence count you swap out the granularity option.


> [..." "].length

Mind you, this is inefficient due to unnecessarily constructing an array. Here’s a more efficient version, though the difference will normally be fairly slight:

  function codePointLength(str) {
      let len = 0;
      for (const c of str) {
          len++;
      }
      return len;
  }
Kinda sad there are no equivalents to the Array methods that work on iterators. Array.prototype.reduce.call(str[Symbol.iterator](), (a, _) => a + 1, 0) doesn’t work since those methods only work on array-like types (meaning those with a length property and indexed by number—and yes, all these Array methods are explicitly defined that way deliberately so you can use them on other array-like types), not iterators.

> [...new Intl.Segmenter('en-US', { granularity: "grapheme" }).segment(" ")].length

Caution: Intl.Segmenter may not be available, so be sure to have a fallback if you want to use it. Chromium shipped it 2½ years ago, Safari 2 years ago, and Firefox hasn’t shipped it yet. (No idea why and I haven’t looked. It’s not always the case: I know of other Intl things that Firefox has shipped first.)


.each_codepoint.size is more efficient than .codepoints.size, as it creates a sized Enumerator that avoids needing to build an intermediate Array. For strings with only single-byte characters it reduces to returning the already-stored stored byte length.

Same goes for .each_byte.size, but for that you have the faster .bytesize method that avoids the intermediate Enumerator.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: