Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you elaborate on what's going on here? Something like "fd" is assuming your filenames are UTF-8, but you're actually using some other encoding?


What's happening here is that fd's `-e/--extension` flag requires that its parameter value be valid UTF-8. The only case where this doesn't work is if you want to filter by a file path whose extension is not valid UTF-8. Needless to say, I can't ever think of a case where you'd really want to do this.

But if you do, fd still supports it. You just can't use the `-e/--extension` convenience:

    $ touch $'a.\xFF'
    $ fd '.*\.(?-u:\xFF)'
    a.�
That is, `fd` requires that the regex patterns you give it be valid UTF-8, but that doesn't mean the patterns themselves are limited to only matching valid UTF-8. You can disable Unicode mode and match raw bytes via escape sequences (that's what `(?-u:\xFF)` is doing).

So as a matter of fact, the sibling comments get the analysis wrong here. `fd` doesn't assume all of its file paths are valid UTF-8. As demonstrated above, it handles non-UTF-8 paths just fine. But some conveniences, like specifying a file extension, do require valid UTF-8.


Right because file names are not guaranteed to be UTF-8. That's the reason Rust has str and then OsStr. You may assume Rust str is valid UTF-8 (unless you have unsafe code tricking it) but you may not assume OsStr is valid UTF-8.

Here an invalid UTF-8 is passed via command line arguments. If it is desired to support this, the correct way is to use args_os https://doc.rust-lang.org/beta/std/env/fn.args_os.html which gives an iterator that yields OsString.


No, OsStr is for OSes that don't encode strings as UTF-8, e.g. Windows by default. Use Path or PathBuf for paths, they're not strings. Pretty much every OS has either some valid strings that aren't valid paths (e.g. the filename `CON` on Windows is a valid string but not valid in a path), some valid paths that aren't valid strings (e.g. UNIX allows any byte other than 0x00 in paths, and any byte other than 0x00 or `/` in filenames, with no restrictions to any text encoding), or both.


fd is assuming the argument to -e is valid UTF-8 while filenames can contain any byte but NUL and /


"I cook up impractical situations and then blame my tools for it"

Nobody cares that valid filenames are anything except the null byte and /. Tell me one valid usecase for a non-UTF8 filename.


UTF-8 is common now, but it hasn't always been. Wanting support for other encoding schemes is a valid ask (though, I think the OP was needlessly rude about it).


It's backwards compatible with ascii right?

But yeah I suppose you would need support for all the other foreign-language encodings that came in between -- UCS-2 for example.

But basically nobody does that. Glib (which drives all GTK apps' and various other apps file reading) doesn't support anything other than UTF8 filenames. At that point I'd consider the "migration" done and dusted.


The world is a lot more complicated & varied than you think :) I was digging around in some hard drives from 2004 just last weekend. At that time, lots of different encodings were common, especially internationally. Software archaelogy is a common hobby, it could be nice to be able to use a tool like this to search through old filesystems. "Not worth the effort" is definitely a valid response to the feature request, but that also doesn't mean there is absolutely no use for the feature.


I can definitely see a use case for supporting non-UTF-8 pathnames on disk (primarily for archaeological purposes).

In a UTF-8-path-only world, what I would do is have a mount option that says that the pathnames on disk are Latin-1 (so that \xff is mapped to U+00FF in UTF-8, which I'm too lazy to work its exact binary representation right now), and let the people doing archaeology on that write their own tools to remap the resulting mojibake pathnames into more readable ones. Not the cleanest solution, but there are ways to support non-UTF-8 disks even with UTF-8-only pathnames.


Oh yeah I can imagine the pain for drives from that era. I remember reading that sometimes you need the right "codebook" - what was the word - installed and stuff like that.


You do not have (or write programs for) filesystems that contain loads of ancient mp3 and wma files.

It is the bane of my existence, but many programs support all the Latin-1 and other file name encodings that are incompatible with UTF-8, so users expect _your_ programs to work too.

Now if you want me to actually _display_ them all correctly, tough turds...


True. Btw curious, is there a defined encoding for text in mp3 metadata? Or is that a pain too.


Running a shell script went badly, generating a bunch of invalid files containing random data in their names, rather than one file containing random data.

You wish to find and delete them all, now that they've turned your home directory into a monstrosity.


nah, eff all that. Roll back the snapshot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: