While most of what you speak of re Ceph is correct, I want to strongly disagree ...

jodrellblank · 2025-09-25T14:40:48 1758811248

I defer to your experience and agree that it really depends on implementation details (and design). I've only worked on a couple of Ceph clusters built by someone else who left, around 1-2PB, 100-150 OSDs, <25 hosts, and not all the same disks in them. They started falling over because some OSDs filled up, and I had to quickly learn about upmap and rebalancing. I don't remember how full they were, but numbers around 75-85% were involved so I'm getting nervous around 75% from my experiences. We suddenly commit 20TB of backup data and that's a 2% swing. It was a regular pain in the neck, stress point, and creaking, amateurishly managed, under-invested Ceph cluster problems caused several outages and some data corruption. Just having some more free space slack in it would have spared us.[1]

That whole situation is probably easier the bigger the cluster gets; any system with three "units" that has to tolerate one failing can only have 66% usable. With a hundred "units" then 99% are usable. Too much free space is only wasting money, too full is a service down disaster, for that reason I would prefer to err towards the side of too much free rather than too little.

Other than Ceph I've only worked on systems where one disk failure needs one hotspare disk to rebuild, anything else is handled by a separate backup and DR plan. With Ceph, depending on the design it might need free space to handle a host or rack failure, and that's pretty new to me and also leads me to prefer more free space rather than less. With a hundred "units" of storage grouped into 5 failure domains then only 80% is usable, again probably better with scale and experienced design.

If I had 10,000 nodes I'd rather 10,100 nodes and better sleep than playing "how close to full can I get this thing" and constantly on edge waiting for a problem which takes down a 10,000 node cluster and all the things that needed such a big cluster. I'm probably taking some advice from Reddit threads talking about 3-node Ceph/Proxmox setups which say 66% and YouTube videos talking about Ceph at CERN - in those I think their use case is a bursty massive dump of particle accelerator data to ingest, followed by a quieter period of read-heavy analysis and reporting, so they need to keep enough free space for large swings. My company's use case was more backup data churn, lower peaks, less tidal, quite predictable, and we did run much fuller than 66%. We're now down below 50% used as we migrate away, and they're much more stable.

[1] it didn't help that we had nobody familiar with Ceph once the builder had left, and these had been running a long time and partially upgraded through different versions, and had one-of-everything; some S3 storage, some CephFS, some RBDs with XFS to use block cloning, some N+1 pools, some Erasure Coding pools, some physical hardware and some virtual machines, some Docker containerised services but not all, multiple frontends hooked together by password based SSH, and no management will to invest or pay for support/consultants, some parts running over IPv6 and some over IPv4, none with DNS names, some front-ends with redundant multiple back end links, others with only one. A well-designed, well-planned, management-supported cluster with skilled admins can likely run with finer tolerances.