More

cuno · on March 8, 2024

We've spent a lot of time identifying bottlenecks and fixing them, up and down the stack, with FUSE being just one of them, even the AWS SDK itself introduces its own set that we've addressed. cunoFS can also be used with FUSE, and we find that it is roughly half the speed of non-FUSE (but thanks to our other optimisations, this is still much faster than alternatives).

cuno · on March 2, 2024

This is a nice simple visual guide to chip fabrication, chiplets, and the advantage of back-side power.

cuno · on Nov 19, 2023

Funnily enough I published something similar as part of my PhD (2010). Essentially communication costs dominate over computation costs - an addition operation is practically free compared to the time and energy of moving data within a chip to the ALU to do the addition, let alone between chips. This is now the reverse situation to historical VLSI - whereby the computation was slow and expensive compared to practically "fast and free" on-chip communication. The implications extend far beyond just RAM. But yes, depending on the access patterns involved and the (physical) spatial arrangement of data, binary tree traversal (on 2D CMP or 2D cross-chip layout) is O(sqrt(N)) or O(log(N)sqrt(N)), and this result is not dependent on the system boundaries of memory hierarchies - it is true for dedicated on-chip scratchpad memories as it is for giant wafer-scale chips (such as Cerebras) as well. There are some special cases where it can be O(1) or O(log(N)) on average that I won't go into, but you can read more here if interested (sections 2.1 and 8.3):

https://www.cl.cam.ac.uk/~swm11/research/greenfield.html

A key way to think about things is that algorithms are not really running in some Platonic realm, each executed instruction occurs at some physical location and time, and data needs to move from one place and time to another place and time. To this end both physical wiring (or networks) are required to move the data spatially, and memory is used to move the data temporally. Together an algorithm's executed instructions are situated somewhere spatio-temporally and RAM (both on-chip and off-chip) serves as a temporal interconnect, but one that itself takes up physical space.

ChuckMcM · on Nov 20, 2023

I like to think of this as level 2 systems analysis. It is implied in some of Feynman's papers on computation as well.

It gets even more interesting (to me) when you consider semantic entanglement of data in non-platonic spaces. When you treat 'time to data access' as a fourth dimension and the state transition vector of an algorithm as the path one can show that Ft(O(path(n))) (Ft being the function that converts the complexity of a path into the time such a path takes to transit) for some arbitrary n is rather difficult to nail down. It can also pop out the result that the lowest complexity path(algorithm) is not faster than a high complexity path(algorithm) if the entangled elements(data) are in a slow space. Crazy I know but it is a pretty straight forward path from Amdahl's law to this sort of analysis.

cuno · on Oct 20, 2023

Hi sorry I missed this. Yes, you can run Samba for SMB, and Ganesha for NFSv4. Feel free to email us if you think there's a better way though.

jra_samba · on Oct 23, 2023

Yes, I think there's a better way (which would make Samba faster on your system :-). Can you ping me at jra@samba.org and I'll explain !

cuno · on Oct 19, 2023

Hi, author here.

Yes agree that you can't "just" put a POSIX API on S3, but that doesn't make it impossible. For the sake of keeping the article to a reasonable size, I left a lot of things out. There are tradeoffs that occur between POSIX semantics, consistency and performance. Each application/process has different needs but the great thing about running right inside the process is that we can see what those needs are and adapt. For example, many applications have no need for random access writes - the only libc calls and syscalls exposed are purely sequential. Some processes have both random access writes and POSIX record locks around them to protect them from other concurrent processes - and we can see that. That means we treat these applications/files differently, with some corresponding performance implications. This is very different to a normal filesystem that has to treat every process the same way because it is a "black box".

You're right that AFS, and for that matter NFS, can in principle return error on close which many existing applications unfortunately aren't written to handle. However, that doesn't mean that NFS isn't practical - it is very widely used.

Our customers mostly run workloads in the same region as the object storage (whether in cloud or on-prem) typically with very high availability. As an essentially networked file system, you're right that it can't make much stronger guarantees than the NFS protocol itself does, but operating inside cloud infrastructure you typically see 4 9s availability.

wrs · on Oct 20, 2023

I’m sure you’re aware of the issues, and this looks like highly useful work! But I’ve seen enough human nature to know that a lot of people will see “POSIX API” and assume they can just run anything on it without further thought. I know this because for years I’ve seen people run things on AFS and NFS, see weird concurrency behavior or data loss or latency, and blame the filesystem for not performing miracles, rather than blaming the application for not taking nonlocal storage into account.

The strongest argument for using a different API for object storage is that you don’t get that excuse. The API presents the true semantics and failure conditions and the application needs to think them through. (And your position that it may not be a strong enough argument is perfectly valid.)

cuno · on Oct 19, 2023

Hi, author here.

Funny you should give the stream-encoding MP4 example, because yes that is what people's experience has been for S3. We've solved that - no temporary local file needed - all streamed directly to S3, for example using ordinary ffmpeg. The trick is a deeper understanding of how multi-part upload works, and if necessary, server-side copy semantics on those parts.

ilyt · on Oct 19, 2023

So how does it work if say application would edit some metadata in existing file on existing object ?

cuno · on Oct 19, 2023

Depends on what you mean by metadata. MP4 metadata is data inside the file - and is modified by server-side-copy semantics that replaces only the bits that are changed. If you mean POSIX metadata, we avoid storing that in the object, and for performance store that elsewhere (it's encoded and compressed in the actual filename of hidden files).

cuno · on Oct 19, 2023

Hi, author here.

Yes we have many large companies (Fortune Global 500) down to small organisations using our software with this kind of interception (see for example https://cuno.io/about-us/). It took us a decent sized team a lot of years to get right, because it is so very hard a problem to crack. But we think it is worth it. And for those who don't want to use such interception, we do offer a FUSE layer as well that still offers much higher performance than alternatives.

cuno · on Oct 19, 2023

Hi, author here.

Yes, we've had to do a lot of things to deal with object storage latency. Since cunoFS is running inside a given process itself, it has greater visibility into what actions it can take and is likely to take. This means we can make huge improvements in our prediction logic, so that we can prefetch within and across files much better, thus hiding latencies. For POSIX metadata, we have a caching mechanism which is shared between processes, and we've invented a better way to encode this metadata so that it is retrieved alongside the LIST operation.

cuno · on Oct 19, 2023

Hi, author here.

The problem is that it's actually really hard to write you own specialized glue layer,and so most developers who do that have poor implementations with low performance and/or incompatibilities with non-AWS solutions. In the context of object storage, we've found lots of applications have tried to add S3 support, and while they work with AWS S3 and have basic functionality, they fail on a lot of other S3-compatible solutions. So they end up tied to AWS, and you can't use them on Microsoft Azure Storage, or often on Google Cloud Storage (despite its S3 gateway), or others.

For instance, a key workhorse of the genomics field is `samtools`, which works with AWS S3 in some ways, but not others (like Amazon Resource Names[1]). Our approach works across vendors transparently, and on S3, is much faster compared to such native implementations.

[1]: https://docs.aws.amazon.com/IAM/latest/UserGuide/reference-a...

cuno · on Oct 19, 2023

Hi, author here.

Yes, we're a startup. With cunoFS, we've made it possible to leverage the lower cost and higher throughput of object storage like S3, and make it work transparently with functions like e.g. mmap(), chmod(), execve(), renameat2().

rewmie · on Oct 19, 2023

> (...) leverage the lower cost and higher throughput of object storage like S3 (...)

What do you mean by "lower cost (...) of object storage like S3"? Isn't S3 terribly expensive? I recall S3 costs around 10$/month per TB you park there without doing anything to it, and AWS charges basically for everything you do to those objects over a network.

mcv · on Oct 20, 2023

Not to mention that apparently we're supposed to compare that to storing on your local harddisk. Isn't local storage always far higher throughput than accessing something across the internet? Being able to store and access stuff on the internet as if it's local is a great idea, but the way they sell it is unconvincing and sounds like they don't really understand what they're talking about.

cuno · on Oct 20, 2023

Hi author here, sorry I missed this post. The performance benchmarks and cost comparisons are for comparing S3 vs EBS (ext4 formatted), EFS, FSx Lustre and others within the same datacenter (i.e. LAN use case rather than WAN use case). That means if you have an EC2 instance running in, say AWS Ohio, and are comparing those storage options also within AWS Ohio, then cunoFS is both cheaper and higher throughput than those other options. It's a different story over WAN. In that case, your own local NVMe storage is going to be cheaper and generally faster that remote storage over a WAN. But that local NVMe storage (on say your solo laptop) isn't going to have anywhere near the Enterprise-grade redundancy, availability and scalability that AWS S3/Azure Blob/Storj/Wasabi/etc has.