Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

    $ touch $'a.\xFF'
    $ find -name $'*.\xFF'
    ./a.?
    $ ./fd -e $'\xFF'
    error: invalid UTF-8 was detected in one or more arguments

    Usage: fd [OPTIONS] [pattern] [path]...

    For more information, try '--help'.
    $
    $ touch III
    $ LC_ALL=tr_TR.UTF-8 find -iname 'iii'
    $ LC_ALL=tr_TR.UTF-8 ./fd 'iii'
    III
    $
Every fucking time


Ok, but like, in practice this is a pretty weird edge case. It's impractical and usually worthless to have filenames that can't be described using the characters on a keyboard.


Disagree, filesystems provide for both services and people... this is an imposition. I, a mere human, may need my tools to wrangle output generated by other software that has never once used a keyboard. Or a sensible character/set, bytes are bytes

File extensions - or their names - mean absolutely nothing with ELF. Maybe $APPLICATION decides to use the filename to store non-ASCII/Unicode parity data... because it was developed in a closet with infinite funding. Who knows. Who's to say.

Contrived, yes, but practical. Imposing isn't. The filesystem may contain more than this can handle.


My point is that it's such a weird edge case in the first place that the chances of you needing to use a tool like fd/find in this way is vanishingly small. I agree with the general issue of treating file paths as encoded strings when they are not. Go is the worst offender here because it does it at the language level which is just egregious.

Regardless, the point is moot because `fd` handles the filenames gracefully you just need to use a different flag [0].

[0]: https://news.ycombinator.com/item?id=43412190


No more unusual than using "find" at all, is my point.


Not at all. It's a common result of the resulting Mojibake (https://en.wikipedia.org/wiki/Mojibake) after moving files between platforms.

It's also what made Python 3 very impractical when it orginally came around. It wasn't fixed until several versions in despite being a common complaint among actual users.


Which keyboard ?


Any of them. File names are in the vast majority of cases human readable in some character encoding, even UTF-8. You would be hard pressed to find a keyboard/character code that has characters that aren't represented in Unicode, but it doesn't matter, just honor the system locale.


I think it's common for tools to assume that file names are valid unicode, not surprized.


Common, but rather stupid. Filenames aren't even text. `fd` is written in Rust, and it uses std::path for paths, the regex pattern defaults to matching text. That said, it is possible by turning off the Unicode flag. `(?-u:\x??)` where `??` is a raw byte in hex. E.g. `(?-u:\xFF)` for OP. See "Opt out of Unicode support[1] in the regex docs.

[1] https://docs.rs/regex/latest/regex/#opt-out-of-unicode-suppo...


IMHO, the kernel should have filesystem mount options to just reject path names that are non-UTF-8, and distros should default to those when creating new filesystems on new systems.

For >99.99% of usecases, file paths are textual data, and people do expect to view them as text. And it's high time that kernels should start enforcing that they act as text, because it constitutes a security vulnerability for a good deal of software while providing exceedingly low value.


So just turn off support for external media, which could possibly be created on other platforms, and all old file systems? Legacy platforms, like modern Windows which still uses UCS-2 (or some half broken variant thereof)?

While I support the UTF-8 everywhere movement with every fiber of my body, that still sounds like a hard sell for all vintage computer enthusiasts, embedded developers, and anyone else, really.


As I said in another comment, you can handle the legacy systems by giving a mount option that transcodes filenames using Latin-1. (Choosing Latin-1 because it's a trivial mapping that doesn't require lookup tables). UCS-2 is easily handled by WTF-8 (i.e., don't treat an encoded unpaired surrogate as an error).

The reality is that non-UTF-8 filenames already break most modern software, and it's probably more useful for the few people who need to care about it to figure out how to make their workflows work in a UTF-8-only filename world rather than demanding that everybody else has to fix their software to handle a case where there kind of isn't a fix in the first place.


What is text? Are the contents of files text? How does one determine if something is text?

(I'm the author of ripgrep, and this is my way of gently suggesting that "filenames aren't even text" isn't an especially useful model.)


Oh, I agree that "text" isn't well-defined. The best I can come up with is that "text" is a valid sequence of bytes when interpreted in some text encoding. I think that something designed to search filenames should clearly document how to search for all valid filenames in its help or manual, not require looking up the docs of a dependency. Filenames are paths, which are weird on every platform. 99% of the time you can search paths using some sort of text encoding, but IMO it should be pointed out in the man page that non-unicode filenames can actually be searched for. `fd`'s man page just links to the regex crate docs, it doesn't generate a new man page for those & name that.

As for "filenames aren't even text" not being a useful model, to me text is a `&str` or `String` or `OsString`, filenames are a `Path` or `PathBuf`. We have different types for paths & strings because they represent different things, and have different valid contents. All I mean by that is the types are different, and the types you use for text shouldn't be the same as the types you use for paths.


I'd suggest engaging with this question, which I think you ignored:

> Are the contents of files text?

It is perhaps the most prescient of all. What is the OS interface for files? Does it tell you, "This is a UTF-8 encoded text file containing short human readable lines"? No, it does not. All you get is bytes, and if you're lucky, you can maybe infer something about the extension of the file's path (but this is only a convention).

How do you turn bytes into a `&str`? Do you think ripgrep converts an entire file to `&str` before searching it? Does ripgrep even do UTF-8 validation at all? No no no, it does not.

I'd suggest giving https://burntsushi.net/bstr/#motivation-based-on-concepts and the crate docs of https://docs.rs/bstr/latest/bstr/ a read.

To be clear, there is no perfect answer here. You've got to do the best with what you've got. But the model I work with is, "treat file contents and file paths as text until you heuristically believe otherwise." But I work on Unix CLI tooling that needs to be fast. For most people, I would say, "validate file contents and file paths as text" is the right model to start with.

> but IMO it should be pointed out in the man page

Docs can always be improved, sure, but that is not what I'm trying to engage with you about. :-)


I'd say some files are text, some are not. And I agree that there's no good way to tell! I think ripgrep has a much harder job than fd, because at least fd can always know that all paths it's searching are valid paths for the OS in use.


My point is that you can apply to the answer to the question "are the contents of files text?" to the question "are file paths text?"


I get it. I think you're right that they both have the same problem, but paths have a std type for handling them that, while file content's don't. As long as you're on an OS you can use std::path::Path (or PathBuf) for paths, and ensure they're valid. I suppose I should have said "Paths aren't Strings" or similar, they might be text but they might not be, and fundamentally the issue is that they're different data types. "Text" isn't universally defined.


You can't really just use `std::path::Path` though. Because it's largely opaque. How do you run a regex or a glob on a `std::path::Path`? Doing a UTF-8 check first is expensive at ripgrep's scale. So it just gets it to `&[u8]` as quickly as it can and treats it as if it were text. (These days you can use `OsStr::as_encoded_bytes`.)

`std::path::Path` isn't necessarily a better design. I mean, on some days, I like it. On other days, I wonder if it was a mistake because it creates so much ceremony. And in many of those cases, the ceremony is totally unwarranted.

And I'm saying this as someone who has been adjudicating Rust's standard library API since Rust 1.0.


Tools must be general. Im not going to invest time using a new one if it cant handle arb vaild filesystems. But thats just me.

https://github.com/jakeogh/angryfiles


`fd` does, as pointed out in this thread in numerous places. So I don't know what your point is, and you didn't engage at all with my prompt.


Had to lookup regex crate docs, but it's possible:

    $ fd '(?-u:\xFF)'
    a.�


Can you elaborate on what's going on here? Something like "fd" is assuming your filenames are UTF-8, but you're actually using some other encoding?


What's happening here is that fd's `-e/--extension` flag requires that its parameter value be valid UTF-8. The only case where this doesn't work is if you want to filter by a file path whose extension is not valid UTF-8. Needless to say, I can't ever think of a case where you'd really want to do this.

But if you do, fd still supports it. You just can't use the `-e/--extension` convenience:

    $ touch $'a.\xFF'
    $ fd '.*\.(?-u:\xFF)'
    a.�
That is, `fd` requires that the regex patterns you give it be valid UTF-8, but that doesn't mean the patterns themselves are limited to only matching valid UTF-8. You can disable Unicode mode and match raw bytes via escape sequences (that's what `(?-u:\xFF)` is doing).

So as a matter of fact, the sibling comments get the analysis wrong here. `fd` doesn't assume all of its file paths are valid UTF-8. As demonstrated above, it handles non-UTF-8 paths just fine. But some conveniences, like specifying a file extension, do require valid UTF-8.


Right because file names are not guaranteed to be UTF-8. That's the reason Rust has str and then OsStr. You may assume Rust str is valid UTF-8 (unless you have unsafe code tricking it) but you may not assume OsStr is valid UTF-8.

Here an invalid UTF-8 is passed via command line arguments. If it is desired to support this, the correct way is to use args_os https://doc.rust-lang.org/beta/std/env/fn.args_os.html which gives an iterator that yields OsString.


No, OsStr is for OSes that don't encode strings as UTF-8, e.g. Windows by default. Use Path or PathBuf for paths, they're not strings. Pretty much every OS has either some valid strings that aren't valid paths (e.g. the filename `CON` on Windows is a valid string but not valid in a path), some valid paths that aren't valid strings (e.g. UNIX allows any byte other than 0x00 in paths, and any byte other than 0x00 or `/` in filenames, with no restrictions to any text encoding), or both.


fd is assuming the argument to -e is valid UTF-8 while filenames can contain any byte but NUL and /


"I cook up impractical situations and then blame my tools for it"

Nobody cares that valid filenames are anything except the null byte and /. Tell me one valid usecase for a non-UTF8 filename.


UTF-8 is common now, but it hasn't always been. Wanting support for other encoding schemes is a valid ask (though, I think the OP was needlessly rude about it).


It's backwards compatible with ascii right?

But yeah I suppose you would need support for all the other foreign-language encodings that came in between -- UCS-2 for example.

But basically nobody does that. Glib (which drives all GTK apps' and various other apps file reading) doesn't support anything other than UTF8 filenames. At that point I'd consider the "migration" done and dusted.


The world is a lot more complicated & varied than you think :) I was digging around in some hard drives from 2004 just last weekend. At that time, lots of different encodings were common, especially internationally. Software archaelogy is a common hobby, it could be nice to be able to use a tool like this to search through old filesystems. "Not worth the effort" is definitely a valid response to the feature request, but that also doesn't mean there is absolutely no use for the feature.


I can definitely see a use case for supporting non-UTF-8 pathnames on disk (primarily for archaeological purposes).

In a UTF-8-path-only world, what I would do is have a mount option that says that the pathnames on disk are Latin-1 (so that \xff is mapped to U+00FF in UTF-8, which I'm too lazy to work its exact binary representation right now), and let the people doing archaeology on that write their own tools to remap the resulting mojibake pathnames into more readable ones. Not the cleanest solution, but there are ways to support non-UTF-8 disks even with UTF-8-only pathnames.


Oh yeah I can imagine the pain for drives from that era. I remember reading that sometimes you need the right "codebook" - what was the word - installed and stuff like that.


You do not have (or write programs for) filesystems that contain loads of ancient mp3 and wma files.

It is the bane of my existence, but many programs support all the Latin-1 and other file name encodings that are incompatible with UTF-8, so users expect _your_ programs to work too.

Now if you want me to actually _display_ them all correctly, tough turds...


True. Btw curious, is there a defined encoding for text in mp3 metadata? Or is that a pain too.


Running a shell script went badly, generating a bunch of invalid files containing random data in their names, rather than one file containing random data.

You wish to find and delete them all, now that they've turned your home directory into a monstrosity.


nah, eff all that. Roll back the snapshot.


It isn't wrong, 0xff is invalid UTF8. Of course if your locale is not set to UTF8 then that is a potential problem.


*nix filenames are series of bytes, not UTF-8 (or anything else) strings. If a find replacement doesn't accept valid (parts of) filenames as input, it's a bit unfortunate.


If all you want to do is match against a sequence of bytes, sure. But when you want to start providing features like case-insensitivity, matching against file extensions, globbing, etc, then you have to declare what a given byte sequence actually represents, and that requires an encoding.


> when you want to start providing features like case-insensitivity

fd does that for English only. See the III/iii case in my comment; iii capitalizes to İİİ in Turkish, there's no way to have fd respect that.


> fd does that for English only.

That's false. Counter-example:

    $ touch 'Δ'
    $ fd δ
    Δ
Your Turkish example doesn't work with `fd` because `fd` doesn't support specific locales or locale specific tailoring for case insensitive matching. It only supports what Unicode calls "simple case folding." It works for things far beyond English, as demonstrated above, but definitely misses some cases specific to particular locales.


Casefolding is a minefield once you extend past English. It is completely unsurprising to find problems with it in other languages.


Yes. I'm the one who implemented the case folding the `fd` uses (via its regex engine).

See: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl...

And then Unicode itself for more discussion on the topic: https://unicode.org/reports/tr18/#Simple_Loose_Matches

TR18 used to have a Level 3[1] with the kind of locale-specific custom tailoring support found in GNU's implementation of POSIX locales, but it was so fraught that it was retracted completely some years ago.

[1]: https://unicode.org/reports/tr18/#Tailored_Support


[flagged]


I don't maintain `fd`. I'm just here to fix your misrepresentations for others following along.

If you need locale specific tailoring, then use `find`. Nothing wrong with that.


[flagged]


@dang - Seems like a troll to me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: