I must point out Jim Gray's paper To Blob or Not To Blob[0]. His team considered NTFS vs. SQL Server, but most rationale applies to any filesystem vs. database decision.
The summary was "The study indicates that if objects are larger than one megabyte on average, NTFS has a clear advantage over SQL Server. If the objects are under 256 kilobytes, the database has a clear advantage. Inside this range, it depends on how write intensive the
workload is," but keep in mind this is spinning media from 2006. Modern SSDs change the equation quite a bit, as they are much more friendly to random IO and benefit less from database write-ahead log and buffer pool behavior.
Also, when deciding between blob vs. filesystem, blobs bring transactional and recovery consistency. The DB is self contained, and all blobs are contained in it. A restore of the DB on a different system yields a consistent system, it won't have links to missing files, and there won't be orphaned files left over (files not referenced by records in DB).
Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc. Blobs bring additional overhead, require bigger DB storage (more expensive usually, think AWS RDS) and the increased size cascades in operational overhead (bigger backups, slower restore etc).
My experience (for an application which had a working set of under 1 GB of files in the 50kb to N MB range, and approximately 50 GB persisted at any given time) was that preserving access to the toolchain which operates trivially with files was worth the additional performance overhead of working with the files and, separately, occasionally having to retool things to e.g. not have 10e7 files in a single folder, which is something that Linux has some opinions (none of them good) about.
Trivial example: it's easy to delete any file most recently accessed more than N months ago [+] with a trivial line in cron (the exact line escapes me -- it involves find), but doing that with a database requires that you roll your own access tracking logic. Incremental backups of a directory structure are easy (rsync or tarsnap); incremental backups of a database appeared to me to be highly non-trivial.
[+] Since we could re-generate PDFs or gifs from our source of truth at will (with 5~10 seconds of added latency), we deleted anything not accessed in N months to limit our hard disk usage.
> e.g. not have 10e7 files in a single folder, which is something that Linux has some opinions (none of them good) about.
I keep running in to this as a common anti-pattern in software, even with software developers that should know better (having fallen afoul of it before).
I ran in to it with some semi-bespoke software once, used by a remarkable number of state departments across the country. It was one of many, many things wrong with that godawful thing. Worse, not only were they dumping files in a single directory, almost any time they went to do something with the directory they'd request a directory listing too. Like many of the bizarre things the app did, it never actually used the information it acquired that way.
I never understood why it mattered that files would be in a single directory or split over multiple directories. At the end of the day it is the same file system with the same number of files. The directories are just a logical structure. Why would there be a difference?
If you want raw file access, your ideal filesystem is a giant key-value store that keys on the full path to the file. This choice means doing a directory listing will involve a lot of random access reads across the disk, and in the days before SSDs this would be a performance killer.
So instead, a directory would have a little section of disk to write its file entries to. These would be fairly small, because most directories are small, and if there were more files added than there was room, another block would be allocated. Doing a file listing would be reasonably quick as you're just iterating over a block and following the link to the next block, but random file access would get slower the more files are added.
It's probably a solvable problem in filesystem design, but because current filesystems fall down hard on the million-files-in-a-directory use case, nobody actually does that, so there's no drive to make a filesystem performant in that case.
Ok but listing 100 directories with 10,000 files each should take the same time than listing one directory with 1,000,000 files (what you are describing is a filesystem with more files vs less files, not with more subdirectories than less subdirectories).
Well, my Linux-fu isn't very up-to-date, but my guess (which actually goes for any file system) is that a directory is a bit more than just a logical organization. It's an actual "thing" in your hard drive which keeps track of everything beneath it. And so it grows in size when more files are there, which slow down access.
Just don't let anyone mount with noatime, for performance. I found that little gem on a drive where paring down to used content would have been very helpful.
> Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc.
This is especially true if you want to deliver them back to the users in the context of a web application. If you place this kind of content in a database, you'll need to serve them with your application.
If you use files for this content, you get two interesting options. For one, you can use any stock web server like nginx to serve these files - and nginx will outperform your application in this context. On top of that, it's easy to push this content onto a CDN in order to further cut the latency to the user.
Completely shameless plug here: My webserver[1] will outperform nginx for serving up static content like this.
It has many optimizations, like not opening the file its serving (which reduces the latency of walking/interacting with the VFS layer); almost no runtime allocations on the heap (minimal stack frames and sizes as well) which can be completely disabled at the expense of logging; high performance logging which does not interfere with processing requests; the normal stuff in a fast webserver (sendfile, keep-alive, minimal code paths between reading the user request and delivering the first byte).
You can check its health by making a request to a known resource -- they're practically free.
It delivers an ETag header that it computes the first time it opens a file and maintains as long as that file is open. If it has to re-open the file it will generate a new ETag value.
In this case nginx is the 'application'. the requests is still going to be expressed as a SQL query, sent to the PG, parsed, compiled, optimized, executed, then the tabular response formatted as the HTTP response. Many more steps compared to a file-on-disk response.
Ah, but the filesystem has to do all that as well! It must receive a path, parse it, and then execute the query, with possible optimizations (eg. ext4 even has indexes implemented with hashed b-trees).
Does your database transparently DMA the buffer cache to the NIC after loading the file in the background using read-ahead without any process context switching or upcalls to userland?
Loosely. Today's file systems aren't transactional (i.e. acid compliance), which is a basic property that most people consider necessary for a database.
Many filesystems are logged and provide atomic operations. You can build multi-step transactions on top of them with various operations. And with CoW filesystems you can even get read consistency. Oh, and optimistic locking also is an option.
Are we talking about the POSIX interface to files or is there a new transactional API that allows for atomic writes (all pages are written to disk or none are), consistency, isolation (opening a file in another process which is currently being written won't show the pages that are being written), and makes the writes always durable (sync; not fsync)?
Multi-file transactions can be built by moving whole directories over symlinks to the previous version. The linux-specific RENAME_EXCHANGE flag can simplify this.
> or is there a new transactional API
CoW filesystems give you snapshots and reflink copies via ioctls. Under heavy concurrency this can provide cheaper isolation than locking.
> and makes the writes always durable (sync; not fsync)?
For durability fsync is sufficient if you do it on the file and the directory.
To combine atomic and durable you can do the write, fsync, rename, fsync dir dance.
>For durability fsync is sufficient if you do it on the file and the directory. To combine atomic and durable you can do the write, fsync, rename, fsync dir dance.
Right. This allows for the correct behaviour but I wouldn't conclude that POSIX is a transactional API. I would regard this as being able to build a transactional API on top of it; but not transactional in itself. I mean, if you had to do the same dance in SQL (insert to dummy pkey and then update the entry) you would rightly dismiss it.
> Also, when deciding between blob vs. filesystem, blobs bring transactional and recovery consistency.
Interesting. I would have thought the other way around, at least for crash-resistance: the I/O stack (including disk hardware) has a tendency to reorder writes and so updates that live within a single file will corrupt fairly easily. Separate files not so much. I vaguely remember a paper on that (Usenix?) and sqlite generally did very well except for that point.
You have two consystency issues with storing the files in the filesystem:
- rollbacks in the DB can lead to orphaned files on disk. One can try to add logic in the app (eg. a catch block that removes the file if the DB rolled back) but that is not gonna help on a crash
- it is impossible to obtain a consistent backup of both the DB and the filesystem. You can backup the filesystem and the DB, but the two will not be consistent between them unless you froze the app during the backup. When you restore the two backups (filesystem, DB) you may encounter any anomaly: orphaned files (exists on filesystem but no entry in DB), broken links (entry in DB referencing a non-existent file) etc. This is because the moment at which the backup 'views' the file and the DB record referencing it are distinct in time.
As for write reordering: write-ahead log systems relies on correct write order. All DBs worth their name enforce this one way or another (via special API, via config requirements etc etc)
I'm surprised that there aren't any tools provided by various databases to handle that usecase.
Something which can abstract away the storage on-disk of large blobs and manage/maintain them over time to prevent a lot of the issues you talk about, but still give the ability for raw file access if/when it's needed.
I've given it all of 10 seconds of thought, but even something like a DB type of a file handle would be useful. Do a query, get back a handle to a file that you can treat just like you opened it yourself.
> Do a query, get back a handle to a file that you can treat just like you opened it yourself
Things are a bit more complex. For one, the trivial problem of client vs. server host. The DB cannot return a handle (a FD) from the server, because it has no meaning on the host running the app. The second problem is that any file manipulation must conform to the DB semantics for transactions, locking, rollback and recovery.
What you describe does exists, is the FileStream feature that dates back to 2007 if I remember correctly. I'm describing the SQL Server feature since this is what I'm familiar with. The app queries the DB for a token, using GET_FILESTREAM_TRANSACTION_CONTEXT[0] and then uses this token to get a Win32 handle for the 'file' using OpenSqlFilestream[1]. The result handle is valid for usual file handle operations (read, write, seek etc).
There were great expectations on this feature, but in real life it flopped. For one it caused all sort of operational headache from the increased DB files size (increased backups size etc) or from problems like having to investigate 'filestream thumbstone status'[2]. But more importantly, adoption required application rewrite (to use the OpenSqlFilestream), which of course never materialized.
File Tables is a newer stab at this problem and this one does allow to expose the DB files as a network share and apps can create and manipulate files on this share and everything is backed by the DB behind the scenes. But turns out a lot of apps do all sort of crazy things with the files, like copy-rename and swap as means to do failure safe saves, but many such operations are significantly more expensive in DB context. And when the DB content is manipulated directly by the apps that 'think' they interact with the filesystem, a lot of useful metadata is never collected in the DB, since the file API used never requires it (think file author, subject etc).
I figured there was something massively annoying with it.
I'm hoping that PostgreSQL can try to tackle this use-case somehow because it's so damn common and 99% of the time the solution that is used completely throws out all the guarantees that the database gives you.
I'm not sure there's a non-annoying way to bridge the problem. The semantics are just so different that there's a lot of friction that needs to be handled somewhere. (What does O_DIRECT mean in this context? What happens when someone decides to store files that are never normally closed?)
In some ways, the problem mirrors OR mappers. A lot of common cases can be handled, but there are always situations where you're confronted with the fact that relational logic just doesn't map cleanly to OO (or hierarchic storage).
Databases use journaling to avoid these issues. The difference between RDBMS' journaling and file system's journaling is the RDBMS journals application data, the file system does not (with few exceptions). Thus, a RDBMS gives you consistent application state, while the file system does not (with fewer exceptions, basically only TxF).
Another pain point is backup; it's not possible to create a consistent snapshot of (rdbms state, other state on the file system). This is avoided entirely if you only need to snapshot (rdbms state,).
> Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc.
If you store them on a filesystem, how do you deal with redundancy / failover / scaling ? At a previous job we did a bit of experimenting with using a clustered FS but they all introduced a lot of problems. However, this was a couple of years ago so the situation may be different now.
If you store them in sqlite, how do you deal with redundancy / failover / scaling? There are other databases with better tooling for this, but they weren't part of this comparison. The world is full of tradeoffs. Scale and availability vs. microbenchmark performance is one of them. It would be very interesting to see a comparison of clustered MySQL or Postgres vs. Gluster or Ceph. I very much doubt the performance side of it would favor the database so much.
> If you store them on a filesystem, how do you deal with redundancy / failover / scaling ?
In most cases, using a solution like Amazon S3 will have all these three sorted out for you pretty good. And while it's not a full mountable filesystem, from a system development perspective it has a really good API for retrieving and storing files.
Technically there is s3fs, but it's fairly terrible and I actively recommend against it. Still, it does come handy from time to time, in a limited set of cases.
After years of development, databases and file systems borrow ideas from each other. I am not surprised that the breakeven range might be expanded. It would be interesting to re-visit the conclusions with current databases, file systems, hardware (SSD vs HDD) etc.
There are several very valid business cases to store files as blobs in a DB.
What's the problem is the DB is 150GB? It's not like the working set (which for the files will just be the metadata) will be that big for storing file blobs.
10GB Database + 140GB of pdfs on the filesystem are not any different to a 150GB DB with everything in.
And you have other issues (consistency, transactional issues, backups, etc).
How about a 100TB database with 95TB being pdf's? It seems there are better ways of managing lots of immutable small files (including for backups/replication) than shoving them all into a database.
>How about a 100TB database with 95TB being pdf's?
How about it?
>It seems there are better ways of managing lots of immutable small files (including for backups/replication) than shoving them all into a database.
Depends on the business case. A database gives certain guarantees you'll have to replicate (usually badly) in any other way.
Conceptually, it's absolutely cleaner.
As for from a scientific or engineering standpoint, there are again no laws that dictate saying whether this is bad or good.
Even the performance characteristics depend on the implementation of the particular DB storage engine. They might be totally on par with storing in the filesystem (or close enough not to matter).
Not to mention that some filesystems might even have worse overhead depending on the type, size, etc. of files. In fact this very FA speaks of "SQLite small blob storage" being "35% Faster Than the Filesystem".
Plus filesystem storage and DBs are not that different in most cases -- they share algorithms for storage and indexing (logstorage, btrees, etc), and the main difference is the access layer. The FreeBSD filesystem, for one, was more like a DB storage layer than a 70s style Unix filesystem.
The summary was "The study indicates that if objects are larger than one megabyte on average, NTFS has a clear advantage over SQL Server. If the objects are under 256 kilobytes, the database has a clear advantage. Inside this range, it depends on how write intensive the workload is," but keep in mind this is spinning media from 2006. Modern SSDs change the equation quite a bit, as they are much more friendly to random IO and benefit less from database write-ahead log and buffer pool behavior.
Also, when deciding between blob vs. filesystem, blobs bring transactional and recovery consistency. The DB is self contained, and all blobs are contained in it. A restore of the DB on a different system yields a consistent system, it won't have links to missing files, and there won't be orphaned files left over (files not referenced by records in DB).
Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc. Blobs bring additional overhead, require bigger DB storage (more expensive usually, think AWS RDS) and the increased size cascades in operational overhead (bigger backups, slower restore etc).
[0] https://www.microsoft.com/en-us/research/publication/to-blob...