GitHub will need to revert this change. They've just crippled pretty much every ...

metrognome · on Jan 30, 2023

Per the post, this was a change to git itself: https://github.com/git/git/commit/4f4be00d302bc52d0d9d5a3d47...

forgotpwd16 · on Jan 30, 2023

What was the thought behind this change?

georgyo · on Jan 30, 2023

If you read the commit message you would see that it is up drop a third party dependency.

forgotpwd16 · on Jan 31, 2023

Yeah, read that. Just don't understand, if git already had an internal gzip implementation, why wasn't it used since it was added?

ilyt · on Jan 31, 2023

Because not everyone refactors whole codebase after adding one function that might be useful somewhere else.

I'd imagine motivation for this change in particular is multiplatform use, not every platform just have gzip in path.

fweimer · on Jan 30, 2023

They could just produce tar output and compress that using system gzip. The “git archive” tool supports many output formats.

acdha · on Jan 30, 2023

If those tools incorrectly assume an API contract which doesn't exist, isn't the right answer to fix those tools?

kentonv · on Jan 30, 2023

In theory, sure, that's what we'd do in an ideal world.

In the real world it will take millions of dollars of eng labor just to update the hashes to fix everything that's currently broken and millions more to actually implement something better and move everyone over to it.

This isn't worth it, GitHub needs to just revert the change and then engineer a way to keep hashes stable going forward.

groestl · on Jan 30, 2023

See also: https://daniel.haxx.se/blog/2013/03/23/why-no-curl-8/

"The amount of work done “out there” on hundreds or thousands of applications for a single little libcurl tweak can be enormous. The last time we bumped the ABI, we got a serious amount of harsh words and critical feedback and since then we’ve gotten many more users!"

kzrdude · on Jan 31, 2023

I know it's superficial but I think the problem would have been reduced if they used a download URL that looked like github.com/archive.php?project=rust&version=deadbeef it's just something that sends a signal and a different expectation on the same artifact.

kzrdude · on Jan 31, 2023

Well, Github presents a file that looks like it comes from a file server, an old "ftp" archive or so. So they model it on that. Already published versions and tar balls should not change in those systems.

I think everyone knows these files are generated on the fly, but it comes from old habits.

nick__m · on Jan 30, 2023

I prefer that tool be adapted to be more resilient and not depend on github particular implementation.

swarfield · on Jan 30, 2023

Using SHA hashes when building guarantees that the code that you are building is what you think it is. How else would you verify dependencies like this, GPG signatures would have the same issue if you change the underlying bits.

Denvercoder9 · on Jan 30, 2023

I wouldn't check the hash of the compressed archive, but of the actual files themselves. It's a bit more metadata, but it's also a lot more robust, and allows you to detect changes after unpacking as well.

bentley · on Jan 30, 2023

It’s generally a bad idea to process (extract) a tarball of unknown provenance. Verifying the tarball is from a known source beforehand mitigates the risk of, say, a malicious tarball that exploits a tar or gzip 0‐day.

shakow · on Jan 30, 2023

But then that's the role of the httpS query with which you will fetch your data.

And if you don't trust your http layer and/or Github's certificate, then you should not trust their archive anyway.

bentley · on Jan 30, 2023

> And if you don't trust your http layer and/or Github's certificate, then you should not trust their archive anyway.

The nice thing about checksumming the tarball is that once you’ve done so, it doesn’t matter whether you trust GitHub or the HTTPS layer or not.

GitHub and its HTTPS cert provide no protection against a compromised project re‐tagging a repo with malicious source, or even deleting and re‐uploading a stable release tarball with something malicious.

viraptor · on Jan 31, 2023

The certificate guarantees the source of the file, not the trust you should put in its contents. I can upload malware as a github project release file and https doesn't change that you shouldn't download/run it.

For software distribution this actually sometimes goes the other way - debian/ubuntu uses http (no s) for their packages, because the content itself is signed by the distribution and this way you can easily cache it at multiple levels.

shakow · on Jan 31, 2023

> I can upload malware as a github project release file and https doesn't change that you shouldn't download/run it.

If you can't trust the archive published by the owner themselves, you are already screwed; a stable hash will just make sure that you trust harder that you are, indeed, downloading contaminated code.

I'm not sure most people here understand how checksums/hashs work, what they protect you against, and what they don't.

c4mpute · on Jan 31, 2023

Software published via GitHub isn't really "published by the owner". The owner typically doesn't control what GitHub does and doesn't always control his own GitHub account.

It isn't only that people don't know what checksums, hashes, and signatures do, it is also problematic that they blindly trust or ignore middlemen. Most supply chain "security" is security theater, almost never is something vetted end-to-end.

ilyt · on Jan 31, 2023

Or just contains 100TB of zeroes

shakow · on Jan 30, 2023

By checking the hash of the extracted files. The hash of the archive is dependent on the order in which the file were compressed, the compression, some metadata, etc.

catiopatio · on Jan 30, 2023

That’s expensive, complicated, exposes a greater attack surface, and requires new tooling to maintain considerably more complex metadata covering the full contents of source archives.

For the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.

The solution here isn’t to change the entire open source ecosystem.

Denvercoder9 · on Jan 30, 2023

> For literally the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.

Well, the norm has been that maintainers generated and distributed a source archive, and that archive being immutable. That workflow is still perfectly fine with GitHub and not impacted by this change.

The problem is that a bunch of maintainers stopped generating and distributing archives, and instead started relying on GitHub to automatically do that for them.

account42 · on Jan 31, 2023

> That workflow is still perfectly fine with GitHub

It would be perfectly fine if you could prevent GitHub from linking the autogenerated archives from the releases or at least distinguish them in a way that makes it clear that they are not immutable maintainer-generated archives.

ilyt · on Jan 31, 2023

The problem was people assuming github works like that - saves a archive of every commit, which is obviously silly if you think about it (why save it if you can regenerate it on a whim from any commit you want?)

jraph · on Jan 31, 2023

You are speaking about release archives. GitHub's "Download as zip" feature is not the same thing as this multi decade-history of open source thing you are talking about.

I always thought zip archives from this feature was generated on the fly, maybe cached, because I don't expect GitHub to store zip archive for every commit of every repository.

I'm actually surprised many important projects are relying on a stable output from this feature, and that this output was actually stable.

bentley · on Jan 30, 2023

Indeed. I remember when Canonical was heavily pushing bzr and others were big fans of Mercurial. Glad my package manager maintainers didn’t waste time writing infrastructure to handle those projects at the repository level. Nobody had to, because providing source tarballs was the norm.

shakow · on Jan 31, 2023

> That’s expensive, complicated,

That sounds like prejudice. Just as a test, I cloned the git repo, which took 29 seconds, then took its hash with `guix hash`, which took 0.387ms.

I think that if you can't handle a 0.4s delay in a build, you have problem problems.

bentley · on Jan 31, 2023

Package builders work on the scale of thousands of packages. The increased time and CPU usage multiplies greatly.

“Complicated” is indisputable. Cloning a repository is absolutely complicated. Fetching a single file over HTTPS is as simple as it gets, these days.

shakow · on Jan 31, 2023

And you really believe that downloading & extracting a source .tar.gz and compiling it will have a run time much shorter than 0.4s?

Just executing the ./configure will take more than that.

bentley · on Jan 31, 2023

> And you really believe …

Huh? What I fully believe is that downloading a source tarball over HTTPS, verifying its checksum, and extracting it will take less time than cloning the repository from Git, then verifying the checksum of all files—which you said would take 29 seconds plus 0.4s.

shakow · on Jan 31, 2023

My point is that either spending 0.08s computing the md5 of the zip (I just measured) or 0.3s computing the hash of the repo does not matter the slightest if you are managing software repos, as just extracting the source and preparing to build it will be an order of magnitude slower.

ArchOversight · on Jan 30, 2023

a git checkout of the code at that particular tag hasn't changed. Just the tarball that git archive generates has.

vlovich123 · on Jan 30, 2023

The two main problems are:

A) How do you catch tarballs that have extra files injected that aren't part of your manifest

B) What does the performance of this look like? Certainly for traditional HDDs this is going to kill performance, but even for SSDs I think verifying a bunch of small files is going to be less efficient than verifying the tarball.

ArchOversight · on Jan 30, 2023

A wouldn't be an issue since you are checking out a git tag.

B would just be a normal git checkout, which already validates that all the objects are reachable and git tags (and commits for that matter) can be signed, and since the sha1 hash is signed as well it validates that the entire tree of commits has not been tampered with. So as long you trust git to not lie about what it is writing to disk, you have a valid checkout of that tag.

And if you do expect it to lie, why do you expect tar to not lie about what it is unpacking?

vlovich123 · on Jan 30, 2023

I know GitHub had asked that clones from package manager use shallow clones. It wouldn't surprise me if downloading tarballs is similarly beneficial to GitHub because it's trivially cacheable in a CDN and thus lowers their operational footprint to support package managers.

ilyt · on Jan 31, 2023

Well, the simplest way would be to make checksum after decompression, that doesn't need per file verify and relies on files being put in same order into tar file.

The other method would be having Manifest file with checksum of every file inside the tar and compare that in-flight, could be simple "read from tar, compare to hash, write to disk" (with maybe some tmpfiles for the bigger ones)

vlovich123 · on Jan 31, 2023

It’s not just about the integrity of the files you’re processing, but also the integrity of the archive itself. If you extract the tarball from a random place, there’s a larger security risk. Now granted HTTPS probably mitigates a lot of it, but cert pinning isn’t that common so MITM attacks aren’t thaaat theoretical.

ilyt · on Jan 31, 2023

You can do validation in flight during extraction. Signed file manifests are how distros like Debian did it since forever, althought in their cases its two step process, the packages themselves contain their own signature and whole directory tree also gets signed (to avoid shenaningans like "attacker putting older, still vulnerable, but signed version into the repo)

duped · on Jan 30, 2023

Ok, now guarantee that.

ErikCorry · on Jan 30, 2023

This seems like a weak argument.

Firstly SHA is not a secure hash.

Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.

What am I missing?

ajross · on Jan 30, 2023

> Firstly SHA is not a secure hash.

It's... literally the Secure Hash Algorithm. (Yes, yes, SHA-1 was broken a while back, but SHA and derivatives were absolutely intended to provide secure collision resistance).

I think you're mixing things up here. Github didn't change the SHA-1 commit IDs in the repositories[1]. They changed the compression algorithm used for (and thus the file contents of) "git archive" output. So your tarballs have the same unpacked data but different hashes under all algorithms, secure or not.

> Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.

Indeed. So you take and record a SHA-256 of the archive file you are tagging such that no one can feasibly do that!

Again, what's happened here is that the links pointing to generated archive files that projects assumed were immutable turned out not to be. It's got nothing to do with security or cryptography.

[1] Which would be a whole-internet-breaking catastrophe, of course. They didn't do that and never will.

chlorion · on Jan 31, 2023

>Firstly SHA is not a secure hash.

This is incorrect, but even if it were true, you could use whatever your hash of choice is instead. Gentoo for example can use whatever hash you like, such as blake2, and the default Gentoo repo captures both the sha512 and blake2 digests in the manifest.

Sha1 is still used for security purposes anyways, even though it really shouldn't be!

Signing git commits still relies on sha1 for security purposes, which I think many people don't realize.

Commit signing only signs the commit object itself, other objects such as the trees, blobs and tags are not involved directly in the signature. The commit object contains sha1 hashes to it's parents, and to a root tree. Since trees contain hashes of all of their items, it creates a recursive chain of hashes of the entire contents of the repo during that point in time!

So signed commits rely entirely on the security of sha1 for now!

You may have already knew all of this about git signing but I thought it might be interesting to mention.

blueflow · on Jan 30, 2023

1) SHA-256 is reasonably secure

2) The checksum assures you that the file you have is the same your upstream looked at

ErikCorry · on Jan 31, 2023

1) Ah of course, this is SHA256, my mistake.

2) If I and the upstream are both looking at a file that was generated by Github then the Sha may match, but that doesn't prove we weren't both owned by Github.

Perhaps what I am missing is that this isn't part of a reproducible build scenario. There's no attempt to ensure that the file Github had built is the one I would build with the same starting point.

blueflow · on Jan 31, 2023

If you trust your upstream, then the checksum is enough. If you don't trust your upstream, its sort of an RCE anyways.

IanCal · on Jan 30, 2023

I think the reproducible build part is about projects that depend on these outputs. The goal is ensuring you and I have both pulled exactly the same dependencies.

Zababa · on Jan 31, 2023

They're all waiting for your pull requests.

naikrovek · on Jan 30, 2023

the change was to git, not GitHub.

nick__m · on Feb 4, 2023

Sorry, I missread the Github annonce and incorrectly interpreted it.

pxc · on Jan 30, 2023

Nixpkgs' so-called binary cache actually also caches source tarballs. Any Nix users out there who ran updates during the change?

Did cache hits save you? Did cache misses break your builds?

anderskaseorg · on Jan 30, 2023

Nixpkgs’s fetchFromGitHub function hashes the contents of GitHub archives after unpacking, so it’s unaffected.

pxc · on Jan 31, 2023

I should have remembered this! Nixpkgs committers are consistently mindful of things like this in code reviews.

clhodapp · on Jan 30, 2023

I could be wrong but believe that nix should be safe for the most part because it does a recursive hash of the stuff it cares about on the extraction of these archives.

jkachmar · on Jan 30, 2023

didn’t realize this had happened until i logged off of my work computer & saw someone had shared this thread in a group chat.

looks like we were completely unaffected, as no one made any updates to derivations referencing GitHub sources in a way that invalidated old entries (i.e. no version bumps, new additions, etc.).