Hacker Newsnew | past | comments | ask | show | jobs | submit | cuno's commentslogin

So after transforming multispectral satellite data into a 128-dimensional embedding vector you can play "Where's Wally" to pinpoint blackberry bushes? I hope they tasted good! I'm guessing you can pretty much pinpoint any other kind of thing as well then?


Yes it's very good fun just exploring the embeddings! It's all wrapped by the geotessera Python library, so with uv and gdal installed just try this for your favourite region to get a false-colour map of the 128-dimensional embeddings:

  # for cambridge
  # https://github.com/ucam-eo/geotessera/blob/main/example/CB.geojson
  curl -OL https://raw.githubusercontent.com/ucam-eo/geotessera/refs/heads/main/example/CB.geojson
  # download the embeddings as geotiffs
  uvx geotessera download --region-file CB.geojson -o cb2
  # do a false colour PCA down to 3 dimensions from 128
  uvx geotessera visualize cb2 cb2.tif
  # project onto webmercator and visualise using leafletjs over openstreetmap
  uvx geotessera webmap cb2.tif --output cb2-map --serve
Because the embeddings are precomputed, the library just has to download the tiles from our server. More at: https://anil.recoil.org/notes/geotessera-python

Downstream classifiers are really fast to train (seconds for small regions). You can try out a notebook in VSCode to mess around with it graphically using https://github.com/ucam-eo/tessera-interactive-map

The berries were a bit sour, summer is sadly over here!


This is all far outside of my wheel house but I'm curious if there's any way to use this for rocks and geology? Identifying dikes and veins on cliff sides from satellites would be really cool.


A major limitation is that most different rock types look essentially identical in visual+NIR spectral ranges. Things separate once you get out to SWIR bands. Sentinel2 does have some SWIR bands and it may work reasonably well with embeddings. But a lot of the signal the embeddings are going to be focused on encoding may not be the right features to distinguish rock types. Methods more focused specifically on the SWIR range are more likely to work reliably. E.g. simple band ratios of SWIR bands may give a cleaner signal than general purpose embeddings in this case.

Hyperspectral in the SWIR range is what you really want for this, but that's a whole different ball game.


> Hyperspectral in the SWIR range is what you really want for this, but that's a whole different ball game.

Are there any hyperspectral surveys with UAVs etc instead of satellites?


Usually airplanes because the instruments are heavy. But yeah, that's the most common case. Hyperspectral sats are much rarer than aerial hyperspectral.


An interesting 30x30m satellite that recently launched and is giving back data last year is EnMAP https://www.enmap.org. Hooking that up to TESSERA is on our todo list as soon as we can get our mittens on the satellite data


It might work. TESSERA's embeddings are at a 10 metre resolution, so it might depend on the size of the features you are looking for. If those features have distinct changes in colour or texture over time or they scatter radar in different ways compared with their surroundings then you should be able to discriminate them.

The easiest way to test is to try out the interactive notebook and drop some labels in known areas.


Is there a way to cluster the embeddings spatially or look for patterns isolated so some dimensions? (Again, way out of my wheel house)

What I mean is a vein is usually a few meters wide but can be hundreds of meters long so ten meter resolution is probably not very helpful unless the embeddings can encode some sort of pattern that stretches across many cells.


It's possible to use embeddings as input to a convolutional network and then train that using labels. We've done that for at least one of the downstream tasks in the TESSERA paper: https://arxiv.org/abs/2506.20380 to estimate canopy height.

The downside of that approach is that you need to spend valuable labels on learning the spatial feature extraction during training. To fix that we're working on building some pre-trained spatial feature extractors that you should only need to minimally fine-tune.


almost definitely!


I haven’t done this kind of thing since undergrad, but hyperspectral data is really frickin cool this way. Not only can you use spectral signatures to identify specific things, but also figure out what those things are made out of by unmixing the spectra.

For example, figure out what crop someone’s growing and decide how healthy it is. With sufficient temporal resolution, you can understand when things are planted and how well they’re growing, how weedy or infiltrated they are by pest plants, how long the soil remains wet or if rainwater runs off and leaves the crop dry earlier than desired. Etc.

If you’re a good guy, you’d leverage this data to empower farmers. If you’re an asshole, you’re looking to see who has planted your crop illegally, or who is breaking your insurance fine print, etc.


Hyperspectral data is really neat though it's worth pointing out that TESSERA is only trained on multispectral (optical + SAR) data.

You are very right on the temporal aspect though, that's what makes the representation so powerful. Crops grow and change colour or scatter patterns in distinct ways.

It's worth pointing out the model and training code is under an Apache2 license and the global embeddings are under a CC-BY-A. We have a python library that makes working with them pretty easy: https://github.com/ucam-eo/geotessera


> If you’re a good guy, you’d leverage this data to empower farmers. If you’re an asshole, you’re looking to see who has planted your crop illegally, or who is breaking your insurance fine print, etc.

How does using it to speculate on crop futures rank?


Every time someone explains the way short selling or speculative markets work, I have a “oh, I get it…” moment and then forget months later.

Same with insurance… socialized risk for our food supply is objectively good, and protecting the insurance mechanism from fraud is good. People can always bastardize these things.


It is complex. I was going to write out how it works in a simple way that everyone could understand - but then I realized that even though it would be a gross simplifications that are unrealistic, it still would be so complex that people would go "yep I understand that to every step", and then finish and not understand it. Every step alone makes perfects sense and is simple, but the total quickly gets complex.

Even calling this a speculative market is a gross simplification of the truth.


It is good to enable people to hedge against bad harvests.


There are two sides hedging against bad harvests, the farmer that grows the crop, and the industry (cattle, ethanol, food oils, and others) that buys that crop. The farmer wants to get paid, and the industry wants to get their crop.


Yes! TESSERA is very new so we're still exploring how well it works for various things.

We're hoping to try it with a few different things for our next field trip, maybe some that are much harder to find than brambles.


I've wondered this about finding hot springs.


That's should be a pretty good usecase; if you do just a few labels manually of known hotsprings you should be able to find others quite quickly using the TESSERA interactive notebook. The embeddings capture the annual spectral-temporal signature, so a hotspring should be fairly distinctive vs the surroundings.

Video of the notebook in action https://crank.recoil.org/w/mDzPQ8vW7mkLjdmWsW8vpQ and the source https://github.com/ucam-eo/tessera-interactive-map


We ended up overriding and replacing with our own thread-safe version years ago when we also hit this.


Founder of cunoFS here, brilliant to see lots of activity in this space, and congrats on the launch! As you'll know, there's a whole galaxy of design decisions when building file storage, and as a storage geek it's fun to see what different choices people make!

I see you've made some similar decisions to what we did for similar reasons I think - making sure files are stored 1:1 exactly as an object without some proprietary backend scrambling, offering strong consistency and POSIX semantics on the file storage, with eventual consistency between S3 and POSIX interfaces, and targeting high performance. Looks like we differ on the managed service vs traditional download and install model, and the client-first vs server-first approach (though some of our users also run cunoFS on an NFS/SMB gateway server), and caching is a paid feature for us versus an included feature for yours.

Look forward to meeting and seeing you at storage conferences!


Great to hear from you, I think cunoFS is doing a lot of things right! It’s certainly a fun problem space!


Is that Gweo? Didn't know you were in the storage space, good to see you!


Good to see you too! Lets catchup sometime - tried to connect with you separately but no luck.


Dynasaur intercepts syscalls in Linux entirely in userspace, superfast and across static, semi-static and dynamic binaries. It even works inside containers and doesn't need admin or any special privileges like PTRACE, or need a VM like GVisor. We built it for our cunoFS virtual filesystem and we want to know if others are interested in buliding on top of it for other use cases?


Nintendo wasn't loyal to the company it was loyal to the team, so when they just decided to leave and form ArtX they took the customer with them... SGI was happy with the Nintendo contract. They earned $1 in additional royalties for every single N64 cartridge sold worldwide. Losing the team was a big blow.


I worked at SGI on the next generation (code named Bali) in 1998 (whole year as an intern) and 1999 (part time while finishing my degree, flying back and forth from Australia). Bali was revolutionary. The goal was realtime Renderman and it really would. I had an absolute blast. I ended up designing the highspeed data paths (shader operations) for world's first floating point frame buffer (FP16 though we called it S10E5) with the logic on embedded DRAM for maximum floating point throughput. It was light years ahead of its time. But the plug got pulled just as we were taping out. Most of the team ended up at Nvidia or ArtX/ATI. The GPU industry was a small world of engineers back then. We'd have house parties with GPU engineers across all the company names you'd expect, and with beer flowing sometimes maybe a few secrets could eh spill. We had an immersive room to give visual demos and Stephen Hawking came in once pitching for a discount.

For team building, we launched potato canons into NASA Moffet field, blew up or melted Sun machines for fun with thermite and explosives. Lots of amazing people and fond memories for a kid getting started.


Very cool. Was Bali going to be the next high-end architecture after InfiniteReality? (I think IR was code-named “Kona” so the tropical codenames fit)

Why did they cancel it, money running out? It’s sad to think they were close to a new architecture but then just kept selling IR for years (and even sold a FireGL-based “Onyx” by the end).

Also was it a separate team working on the lower-end graphics like VPro/Odyssey?


Yes Bali was the next gen architecture and incredibly scalable. It consisted of many different chips connected together in a network that could scale. The R chip was so big existing tools couldn't handle it and ppl were writing their own tools. As a result it was very expensive to tape out so many hefty chips and I think that's why when it came time, and with a financial crisis, upper management pulled the plug.

Yes there were separate teams working on the lower-end graphics.


Why was Bali cancelled?


The problem with these approaches is that the data is scrambled on the backend, so you can't access the files directly from S3 anymore. Instead you need an S3 gateway to convert from scrambled S3 to unscrambled S3. They rely on a separate database to reassemble the pieces back together again.


It depends on if you want to expose filesystem semantics or metadata to applications using it. For example random access writes are done by ffmpeg, which is a workhorse of the media industry, but most things can't handle that or are too slow. We had to build our own solution cunoFS to make it work properly at high speeds.


Actually we've found it's often much worse than that. Code written against AWS S3 using the AWS SDK often doesn't work on a great many "S3-compatible" vendors (including on-prem versions). Although there's documentation on S3, it's vague in many ways, and the AWS SDKs rely on actual AWS behaviour. We've had to deal with a lot of commercial and cloud vendors that subtly break things. This includes giant public cloud companies. In one case a giant vendor only failed at high loads, making it appear to "work" until it didn't, because its backoff response was not what the AWS SDK expected. It's been a headache that we've had to deal for cunoFS, as well as making it work with GCP and Azure. At the big HPC conference Supercomputing 2023, when we mentioned supporting "S3 compatible" systems, we would often be told stories about applications not working with their supposedly "S3 compatible" one (from a mix of vendors).


Back in 2011 when I was working on making Ceph's RadosGW more S3-compatible, it was pretty common that AWS S3 behavior differed from their documentation too. I wrote a test suite to run against AWS and Ceph, just to figure out the differences. That lives on at https://github.com/ceph/s3-tests


What differences in behaviour from the AWS docs did you find, out of interest?


What I can dig up today is that back in 2011, they documented that bucket names cannot look like IPv4 addresses and the character set was a-z0-9.-, but they failed to prevent 192.168.5.123 or _foo.

I recall there were more edge cases around HTTP headers, but they don't seem to have been recorded as test cases -- it's been too long for me to remember details, I may have simply ran out of time / real world interop got good enough to prioritize something else.

2011 state, search for fails_on_aws: https://github.com/tv42/s3-tests/blob/master/s3tests/functio...

Current state, I can't speak to the exact semantics of the annotations today, they could simply be annotating non-AWS features: https://github.com/ceph/s3-tests/blob/master/s3tests/functio...


We and our customers use S3 as a POSIX filesystem, and we generally find it faster than a local filesystem for many benchmarks. For listing directories we find it faster than Lustre (a real high performance filesystem). Our approach is to first try listing directories with a single ListObjectV2 (which on AWS S3 is in lexicographic order) and if it hasn't made much progress, we start listing with parallel ListObjectV2. Once you start parallelising the ListObjectV2 (rather than sequentially "continuing") you get massive speedups.


> find it faster than a local filesystem for many benchmarks.

What did you measure? How did you compare? This claim seems very contrary to my experience and understanding of how things work...

Let me refine the question: did you measure metadata or data operations? What kind of storage medium is used by the filesystem you use? How much memory (and subsequently the filesystem cache) does your system have?

----

The thing is: you should expect, in the best case, something like 5 ms latency on network calls over the Internet in an ideal case. Within the datacenter, maybe you can achieve sub-ms latency, but that's hard. AWS within region but different zones tends to be around 1 ms latency.

This is while NVMe latency, even on consumer products, is 10-20 micro seconds. I.e. we are talking about roughly 100 times faster than anything going through the network can offer.


For AWS, we're comparing against filesystems in the datacenter - so EBS, EFS and FSx Lustre. Compared to these, you can see in the graphs where S3 is much faster for workloads with big files and small files: https://cuno.io/technology/

and in even more detail of different types of EBS/EFS/FSx Lustre here: https://cuno.io/blog/making-the-right-choice-comparing-the-c...


The tests are very weird...

Normally, from someone working in the storage, you'd expect tests to be in IOPS, and the goto tool for reproducible tests is FIO. I mean, of course "reproducibility" is a very broad subject, but people are so used to this tool that they develop certain intuition and interpretation for it / its results.

On the other hand, seeing throughput figures is kinda... it tells you very little about how the system performs. Just to give you some reasons: a system can be configured to do compression or deduplication on client / server, and this will significantly impact your throughput, depending on what do you actually measure: the amount of useful information presented to the user or the amount of information transferred. Also throughput at the expense of higher latency may or may not be a good thing... Really, if you ask anyone who ever worked on a storage product about how they could crank up throughput numbers, they'd tell you: "write bigger blocks asynchronously". This is the basic recipe, if that's what you want. Whether this makes a good all around system or not... I'd say, probably not.

Of course, there are many other concerns. Data consistency is a big one, and this is a typical tradeoff when it comes to choosing between object store and a filesystem, since filesystem offers more data consistency guarantees, whereas object store can do certain things faster, while breaking them.

BTW, I don't think most readers would understand Lustre and similar to be the "local filesystem", since it operates over network and network performance will have a significant impact, of course, it will also put it in the same ballpark as other networked systems.

I'd also say that Ceph is kinda missing from this benchmark... Again, if we are talking about filesystem on top of object store, it's the prime example...


IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads, except for truly random I/O in applications such as databases. For example, in Machine Learning, training usually consists of taking large datasets (sometimes many PBs in scale), randomly shuffling them each Epoch, and feeding them into the engine as fast as possible. Because of this, we see storage vendors for ML workloads concentrate on IOPS numbers. The GPUs however only really care about throughput. Indeed, we find a great many applications only really care about the throughput, and IOPS is only relevant if it helps to accomplish that throughput. For ML, we realised that the shuffling isn't actually random - there's no real reason for it to be random versus pseudo-random. And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect - yielding a 60x boost in throughput on S3, beating out a bunch of other solutions. S3 is not going to do great for truly random I/O, however, we find that most scientific, media and finance workloads are actually deterministic or semi-deterministic, and this is where cunoFS, by peering inside each process, can better predict intra-file and inter-file access patterns, so that we can hide the latencies present in S3. At the end of the day, the right benchmark is the one that reflects real world usage of applications, but that's a lot of effort to document one by one.

I agree that things like dedupe and compression can affect things, so in our large file benchmarks each file is actually random. The small file benchmarks aren't affected by "write bigger blocks" because there's nothing bigger than the file itself. Yes, data consistency can be an issue, and we've had to do all sorts of things to ensure POSIX consistency guarantees beyond what S3 (or compatible) can provide. These come with restrictions (such as on concurrent writes to the same file on multiple nodes), but so does NFS. In practice, we introduced a cunoFS Fusion mode that relies on a traditional high-IOPS filesystem for such workloads and consistency (automatically migrating data to that tier), and high throughput object for other workloads that don't need it.


> And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect

This is an interesting hack. However, an IOP is an IOP, no matter how good you predicted it and prefetch it so that you hide the latency it's going to be translated to a GetObject.

I think what you really exploited here is that even though S3 is built on HDDs (and have very low IOPS per TiB) their scale is so large that even if you milk 1M+ IOPS out of it AWS still doesn't care and is happy to serve you. But if my back-of-envelope calculation is correct this isn't going to work well if everyone starts to do it.

How do you get around S3's 5.5k GET per second per prefix limit? If I only have ~200 20GiB files can you still get decent IOPS out of it?

and...

> IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads

No, it's not. I have a workload training a DL model on time series data which demands 600k 8KiB IOPS per compute instance. None of the thing I tested work well. Had to build a custom one with bare metal NVMe-s.


Sorry for the late response - I didn't see your comment until now.

Our aim is to unleash all the potential that S3/Object has to offer for file system workloads. Yes, the scale of AWS S3 helps, as does erasure coding (which enhances flexibility for better load balancing of reads).

Is it suitable for every possible workload? No, which is why we have a mode called cunoFS Fusion where we let people combine a regular high-performance filesystem for IOPS, and Object for throughput, with data automatically migrated between the two according to workload behaviour. What we find is that most data/workloads need high throughput rather than high IOPS, and this tends to be the bulk of data. So rather than paying for PBs of ultra-high IOPS storage, they only need to pay for TBs of it instead. Your particular workload might well need high IOPS, but a great many workloads do not. We do have organisations doing large scale workloads on time-series (market) data using cunoFS with S3 for performance reasons.


EFS is ridiculously slow though. Almost to the point where I fail to see how it’s actually useful for any of the traditional use cases for NFS.


> EFS is ridiculously slow though. Almost to the point where I fail to see how it’s actually useful for any of the traditional use cases for NFS.

Would you care to elaborate on your experience or use case a bit more? We've made a lot of improvements over the last few years (and are actively working on more), and we have many happy customers. I'd be happy to give a perspective of how well your use case would work with EFS.

Source: PMT turned engineer on EFS, with the team for over 6 years


Unfortunately I can’t say too much publicly on HN. But one of the big shortcomings is dealing with hundreds of files. It doesn’t even matter if those are big or small files (I’ve had experience with both).

Services like DataSync show that the underlying infra can be performant. But it feels almost impossible to replicate that on EFS via standard POSIX APIs. And unfortunately one of our use cases depend upon that.

If feels, to me at least, like EFS isn’t where AWSs priorities lie. At least if you compare EFS to FSx Lustre and recent developments to S3. Both of which has been the direction our AWS SAs have pushed us.


if you turn all the EFS performance knobs up (at a high cost), it's quite fast.


Faster, sure. But I wouldn’t got so far as to say it is fast


Have you tried it recently? Because we've made it a lot faster over the years.


More recently and for more use cases and varied workflows than most people. But that’s as much as I can say without getting people to sign an NDA.

Our AWS spend is high enough to warrant a very close working relationship with AWS so this is something we have worked with you guys on already.


S3 is really high latency though. I store parquet files on S3 and querying them through DuckDB is much slower than file system because random access patterns. I can see S3 being decent if it’s bulk access but definitely not for random access.

This is why there’s a new S3 Express offering that is low latency (but costs more).


It can't be a POSIX filesystem if it doesn't meet POSIX filesystem guarantees. I worked on an S3 compatible object store in a large storage company and we also had distributed filesystem products. Those are completely different animals due to the different semantics and requirements. We've also built compliant filesystems over object store and the other way around. Certain operations like, write-append, are tricky to simulate over object stores (S3 didn't use to support append, I haven't really stayed up to date, does it now?). At least when I worked on this it wasn't possible to simulate POSIX semantics over S3 at all without needing to add additional object store primitives.


> Once you start parallelising the ListObjectV2 (rather than sequentially "continuing")

How are you "parallelizing" the ListObjectsV2? The continuation token can be only fed in once the previous ListObjectsV2 response has completed, unless you know the name or structure of keys ahead of time, in which listing objects isn't necessary.


For example, you can do separate parallel ListObjectV2 for files starting a-f and g-k, etc.. covering the whole key space. You can parallelize recursively based on what is found in the first 1000 entries so that it matches the statistics of the keys. Yes there may be pathological cases, but in practice we find this works very well.


You're right that it won't work for all use cases, but starting two threads with prefixes A and M, for example, is one way you might achieve this.


If you think s3 is fast, you should try FTP. It’s at least a hundred times faster. And combined with rsync, dozens of times more reliable.


Neither of those are true though? Not sure if this is sarcastic or not, if so make it more clear in the future


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: