In theory, sure, that's what we'd do in an ideal world.
In the real world it will take millions of dollars of eng labor just to update the hashes to fix everything that's currently broken and millions more to actually implement something better and move everyone over to it.
This isn't worth it, GitHub needs to just revert the change and then engineer a way to keep hashes stable going forward.
"The amount of work done “out there” on hundreds or thousands of applications for a single little libcurl tweak can be enormous. The last time we bumped the ABI, we got a serious amount of harsh words and critical feedback and since then we’ve gotten many more users!"
I know it's superficial but I think the problem would have been reduced if they used a download URL that looked like github.com/archive.php?project=rust&version=deadbeef it's just something that sends a signal and a different expectation on the same artifact.
Well, Github presents a file that looks like it comes from a file server, an old "ftp" archive or so. So they model it on that. Already published versions and tar balls should not change in those systems.
I think everyone knows these files are generated on the fly, but it comes from old habits.
Using SHA hashes when building guarantees that the code that you are building is what you think it is. How else would you verify dependencies like this, GPG signatures would have the same issue if you change the underlying bits.
I wouldn't check the hash of the compressed archive, but of the actual files themselves. It's a bit more metadata, but it's also a lot more robust, and allows you to detect changes after unpacking as well.
It’s generally a bad idea to process (extract) a tarball of unknown provenance. Verifying the tarball is from a known source beforehand mitigates the risk of, say, a malicious tarball that exploits a tar or gzip 0‐day.
> And if you don't trust your http layer and/or Github's certificate, then you should not trust their archive anyway.
The nice thing about checksumming the tarball is that once you’ve done so, it doesn’t matter whether you trust GitHub or the HTTPS layer or not.
GitHub and its HTTPS cert provide no protection against a compromised project re‐tagging a repo with malicious source, or even deleting and re‐uploading a stable release tarball with something malicious.
The certificate guarantees the source of the file, not the trust you should put in its contents. I can upload malware as a github project release file and https doesn't change that you shouldn't download/run it.
For software distribution this actually sometimes goes the other way - debian/ubuntu uses http (no s) for their packages, because the content itself is signed by the distribution and this way you can easily cache it at multiple levels.
> I can upload malware as a github project release file and https doesn't change that you shouldn't download/run it.
If you can't trust the archive published by the owner themselves, you are already screwed; a stable hash will just make sure that you trust harder that you are, indeed, downloading contaminated code.
I'm not sure most people here understand how checksums/hashs work, what they protect you against, and what they don't.
Software published via GitHub isn't really "published by the owner". The owner typically doesn't control what GitHub does and doesn't always control his own GitHub account.
It isn't only that people don't know what checksums, hashes, and signatures do, it is also problematic that they blindly trust or ignore middlemen. Most supply chain "security" is security theater, almost never is something vetted end-to-end.
By checking the hash of the extracted files. The hash of the archive is dependent on the order in which the file were compressed, the compression, some metadata, etc.
That’s expensive, complicated, exposes a greater attack surface, and requires new tooling to maintain considerably more complex metadata covering the full contents of source archives.
For the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.
The solution here isn’t to change the entire open source ecosystem.
> For literally the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.
Well, the norm has been that maintainers generated and distributed a source archive, and that archive being immutable. That workflow is still perfectly fine with GitHub and not impacted by this change.
The problem is that a bunch of maintainers stopped generating and distributing archives, and instead started relying on GitHub to automatically do that for them.
> That workflow is still perfectly fine with GitHub
It would be perfectly fine if you could prevent GitHub from linking the autogenerated archives from the releases or at least distinguish them in a way that makes it clear that they are not immutable maintainer-generated archives.
The problem was people assuming github works like that - saves a archive of every commit, which is obviously silly if you think about it (why save it if you can regenerate it on a whim from any commit you want?)
You are speaking about release archives. GitHub's "Download as zip" feature is not the same thing as this multi decade-history of open source thing you are talking about.
I always thought zip archives from this feature was generated on the fly, maybe cached, because I don't expect GitHub to store zip archive for every commit of every repository.
I'm actually surprised many important projects are relying on a stable output from this feature, and that this output was actually stable.
Indeed. I remember when Canonical was heavily pushing bzr and others were big fans of Mercurial. Glad my package manager maintainers didn’t waste time writing infrastructure to handle those projects at the repository level. Nobody had to, because providing source tarballs was the norm.
Huh? What I fully believe is that downloading a source tarball over HTTPS, verifying its checksum, and extracting it will take less time than cloning the repository from Git, then verifying the checksum of all files—which you said would take 29 seconds plus 0.4s.
My point is that either spending 0.08s computing the md5 of the zip (I just measured) or 0.3s computing the hash of the repo does not matter the slightest if you are managing software repos, as just extracting the source and preparing to build it will be an order of magnitude slower.
A) How do you catch tarballs that have extra files injected that aren't part of your manifest
B) What does the performance of this look like? Certainly for traditional HDDs this is going to kill performance, but even for SSDs I think verifying a bunch of small files is going to be less efficient than verifying the tarball.
A wouldn't be an issue since you are checking out a git tag.
B would just be a normal git checkout, which already validates that all the objects are reachable and git tags (and commits for that matter) can be signed, and since the sha1 hash is signed as well it validates that the entire tree of commits has not been tampered with. So as long you trust git to not lie about what it is writing to disk, you have a valid checkout of that tag.
And if you do expect it to lie, why do you expect tar to not lie about what it is unpacking?
I know GitHub had asked that clones from package manager use shallow clones. It wouldn't surprise me if downloading tarballs is similarly beneficial to GitHub because it's trivially cacheable in a CDN and thus lowers their operational footprint to support package managers.
Well, the simplest way would be to make checksum after decompression, that doesn't need per file verify and relies on files being put in same order into tar file.
The other method would be having Manifest file with checksum of every file inside the tar and compare that in-flight, could be simple "read from tar, compare to hash, write to disk" (with maybe some tmpfiles for the bigger ones)
It’s not just about the integrity of the files you’re processing, but also the integrity of the archive itself. If you extract the tarball from a random place, there’s a larger security risk. Now granted HTTPS probably mitigates a lot of it, but cert pinning isn’t that common so MITM attacks aren’t thaaat theoretical.
You can do validation in flight during extraction. Signed file manifests are how distros like Debian did it since forever, althought in their cases its two step process, the packages themselves contain their own signature and whole directory tree also gets signed (to avoid shenaningans like "attacker putting older, still vulnerable, but signed version into the repo)
Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.
It's... literally the Secure Hash Algorithm. (Yes, yes, SHA-1 was broken a while back, but SHA and derivatives were absolutely intended to provide secure collision resistance).
I think you're mixing things up here. Github didn't change the SHA-1 commit IDs in the repositories[1]. They changed the compression algorithm used for (and thus the file contents of) "git archive" output. So your tarballs have the same unpacked data but different hashes under all algorithms, secure or not.
> Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.
Indeed. So you take and record a SHA-256 of the archive file you are tagging such that no one can feasibly do that!
Again, what's happened here is that the links pointing to generated archive files that projects assumed were immutable turned out not to be. It's got nothing to do with security or cryptography.
[1] Which would be a whole-internet-breaking catastrophe, of course. They didn't do that and never will.
This is incorrect, but even if it were true, you could use whatever your hash of choice is instead. Gentoo for example can use whatever hash you like, such as blake2, and the default Gentoo repo captures both the sha512 and blake2 digests in the manifest.
Sha1 is still used for security purposes anyways, even though it really shouldn't be!
Signing git commits still relies on sha1 for security purposes, which I think many people don't realize.
Commit signing only signs the commit object itself, other objects such as the trees, blobs and tags are not involved directly in the signature. The commit object contains sha1 hashes to it's parents, and to a root tree. Since trees contain hashes of all of their items, it creates a recursive chain of hashes of the entire contents of the repo during that point in time!
So signed commits rely entirely on the security of sha1 for now!
You may have already knew all of this about git signing but I thought it might be interesting to mention.
2) If I and the upstream are both looking at a file that was generated by Github then the Sha may match, but that doesn't prove we weren't both owned by Github.
Perhaps what I am missing is that this isn't part of a reproducible build scenario. There's no attempt to ensure that the file Github had built is the one I would build with the same starting point.
I think the reproducible build part is about projects that depend on these outputs. The goal is ensuring you and I have both pulled exactly the same dependencies.
I could be wrong but believe that nix should be safe for the most part because it does a recursive hash of the stuff it cares about on the extraction of these archives.
didn’t realize this had happened until i logged off of my work computer & saw someone had shared this thread in a group chat.
looks like we were completely unaffected, as no one made any updates to derivations referencing GitHub sources in a way that invalidated old entries (i.e. no version bumps, new additions, etc.).