I must point out Jim Gray's paper To Blob or Not To Blob[0]. His team considered...

patio11 · on June 14, 2017

My experience (for an application which had a working set of under 1 GB of files in the 50kb to N MB range, and approximately 50 GB persisted at any given time) was that preserving access to the toolchain which operates trivially with files was worth the additional performance overhead of working with the files and, separately, occasionally having to retool things to e.g. not have 10e7 files in a single folder, which is something that Linux has some opinions (none of them good) about.

Trivial example: it's easy to delete any file most recently accessed more than N months ago [+] with a trivial line in cron (the exact line escapes me -- it involves find), but doing that with a database requires that you roll your own access tracking logic. Incremental backups of a directory structure are easy (rsync or tarsnap); incremental backups of a database appeared to me to be highly non-trivial.

[+] Since we could re-generate PDFs or gifs from our source of truth at will (with 5~10 seconds of added latency), we deleted anything not accessed in N months to limit our hard disk usage.

troutaway123 · on June 14, 2017

¿Porque no los dos?

Store the files in a database but expose them via FUSE.

pjc50 · on June 14, 2017

That gives you the time cost of a filesystem plus the time cost of a database plus an extra round-trip in and out of kernel space.

troutaway123 · on June 14, 2017

> preserving access to the toolchain which operates trivially with files was worth the additional performance overhead of working with the files

Sounds like they're OK with that.

In any case, couldn't you avoid the kernel trips with some dylib foolery (assuming the toolchain is dynamically linked)?

firethief · on June 14, 2017

So: store the files in a database, expose them via FUSE, bypass FUSE and access the database?

troutaway123 · on June 17, 2017

Yep ;D

octetta · on June 14, 2017

...or v9fs... but the overhead might make the whole exercise pointless.

Twirrim · on June 14, 2017

> e.g. not have 10e7 files in a single folder, which is something that Linux has some opinions (none of them good) about.

I keep running in to this as a common anti-pattern in software, even with software developers that should know better (having fallen afoul of it before).

I ran in to it with some semi-bespoke software once, used by a remarkable number of state departments across the country. It was one of many, many things wrong with that godawful thing. Worse, not only were they dumping files in a single directory, almost any time they went to do something with the directory they'd request a directory listing too. Like many of the bizarre things the app did, it never actually used the information it acquired that way.

cm2187 · on June 14, 2017

I never understood why it mattered that files would be in a single directory or split over multiple directories. At the end of the day it is the same file system with the same number of files. The directories are just a logical structure. Why would there be a difference?

jamie_ca · on June 14, 2017

Spinning disks.

If you want raw file access, your ideal filesystem is a giant key-value store that keys on the full path to the file. This choice means doing a directory listing will involve a lot of random access reads across the disk, and in the days before SSDs this would be a performance killer.

So instead, a directory would have a little section of disk to write its file entries to. These would be fairly small, because most directories are small, and if there were more files added than there was room, another block would be allocated. Doing a file listing would be reasonably quick as you're just iterating over a block and following the link to the next block, but random file access would get slower the more files are added.

It's probably a solvable problem in filesystem design, but because current filesystems fall down hard on the million-files-in-a-directory use case, nobody actually does that, so there's no drive to make a filesystem performant in that case.

the8472 · on June 14, 2017

ext4 uses btrees for the directory index, so accessing files by name is just as fast as manually splitting it into a prefix-based directory tree.

Searching by filename on the other hand can be much faster with a prefix tree because directories only allow linear scans, not range-based ones.

ngrilly · on June 14, 2017

I'm missing something. What difference do you see between "accessing files by name" and "searching by filename"?

the8472 · on June 14, 2017

    f = open("/path/to/file");
    // do something with file

vs.

    d = opendir("path/to")
    while((dent = readdir(d)) != NULL) {
       if(!dent.name.startsWith("file"))
          continue;
       // do something with first matched entry
    }

The former is O(log n) on ext4, O(n) on older filesystems that use flat lists.

The latter is always O(n)

ngrilly · on June 14, 2017

Understood. Thanks ;-)

cm2187 · on June 14, 2017

Ok but listing 100 directories with 10,000 files each should take the same time than listing one directory with 1,000,000 files (what you are describing is a filesystem with more files vs less files, not with more subdirectories than less subdirectories).

hulahoof · on June 14, 2017

The benefit is with accessing single files and not requiring a 1,000,000 directory listing lookup

HeavyStorm · on June 14, 2017

Well, my Linux-fu isn't very up-to-date, but my guess (which actually goes for any file system) is that a directory is a bit more than just a logical organization. It's an actual "thing" in your hard drive which keeps track of everything beneath it. And so it grows in size when more files are there, which slow down access.

Walf · on June 14, 2017

Just don't let anyone mount with noatime, for performance. I found that little gem on a drive where paring down to used content would have been very helpful.

cat199 · on June 14, 2017

Or, consider your use case for this at least.

Some user-level backups for example, will make the atime useless.

stuaxo · on June 16, 2017

Even with way fewer files than that you can't use wildcards in bash because it will expands to longer than the commandline allows.

This makes any work in the shell painful.

ccommsxx · on June 14, 2017

50GB of files in the 50KB-inf range can be at most 10e5 files, not 10e7. (10e7 would be 5TB of 50KB files)

tetha · on June 14, 2017

> Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc.

This is especially true if you want to deliver them back to the users in the context of a web application. If you place this kind of content in a database, you'll need to serve them with your application.

If you use files for this content, you get two interesting options. For one, you can use any stock web server like nginx to serve these files - and nginx will outperform your application in this context. On top of that, it's easy to push this content onto a CDN in order to further cut the latency to the user.

rkeene2 · on June 14, 2017

Completely shameless plug here: My webserver[1] will outperform nginx for serving up static content like this.

It has many optimizations, like not opening the file its serving (which reduces the latency of walking/interacting with the VFS layer); almost no runtime allocations on the heap (minimal stack frames and sizes as well) which can be completely disabled at the expense of logging; high performance logging which does not interfere with processing requests; the normal stuff in a fast webserver (sendfile, keep-alive, minimal code paths between reading the user request and delivering the first byte).

[1] http://filed.rkeene.org/

tetha · on June 14, 2017

Color me interested. How do I health-check this in a LB? What kind of caching headers do you give me out of the box?

rkeene2 · on June 14, 2017

You can check its health by making a request to a known resource -- they're practically free.

It delivers an ETag header that it computes the first time it opens a file and maintains as long as that file is open. If it has to re-open the file it will generate a new ETag value.

rusanu · on June 14, 2017

Amen to that. Fastest database query is the one you never run.

icebraining · on June 14, 2017

If you place this kind of content in a database, you'll need to serve them with your application.

Well, not necessarily: https://github.com/FRiCKLE/ngx_postgres/

rusanu · on June 14, 2017

In this case nginx is the 'application'. the requests is still going to be expressed as a SQL query, sent to the PG, parsed, compiled, optimized, executed, then the tabular response formatted as the HTTP response. Many more steps compared to a file-on-disk response.

But I second that is an interesting nginx module

icebraining · on June 14, 2017

Ah, but the filesystem has to do all that as well! It must receive a path, parse it, and then execute the query, with possible optimizations (eg. ext4 even has indexes implemented with hashed b-trees).

A filesystem is just an hierarchical database.

cat199 · on June 14, 2017

Does your database transparently DMA the buffer cache to the NIC after loading the file in the background using read-ahead without any process context switching or upcalls to userland?

That's what I thought.

https://www.freebsd.org/cgi/man.cgi?query=sendfile&sektion=2

http://man7.org/linux/man-pages/man2/sendfile.2.html

coldtea · on June 15, 2017

Does your filesystem provide for its files that are referenced by your DB the same versioning, consistent backups, and ACID guarantees as the DB does?

That's what I thought.

fnord123 · on June 14, 2017

>A filesystem is just an hierarchical database.

Loosely. Today's file systems aren't transactional (i.e. acid compliance), which is a basic property that most people consider necessary for a database.

the8472 · on June 14, 2017

Many filesystems are logged and provide atomic operations. You can build multi-step transactions on top of them with various operations. And with CoW filesystems you can even get read consistency. Oh, and optimistic locking also is an option.

fnord123 · on June 14, 2017

Are we talking about the POSIX interface to files or is there a new transactional API that allows for atomic writes (all pages are written to disk or none are), consistency, isolation (opening a file in another process which is currently being written won't show the pages that are being written), and makes the writes always durable (sync; not fsync)?

the8472 · on June 15, 2017

> Are we talking about the POSIX interface to files

Those give you these atomic ops and optimistic or cooperative locking: https://rcrowley.org/2010/01/06/things-unix-can-do-atomicall...

Multi-file transactions can be built by moving whole directories over symlinks to the previous version. The linux-specific RENAME_EXCHANGE flag can simplify this.

> or is there a new transactional API

CoW filesystems give you snapshots and reflink copies via ioctls. Under heavy concurrency this can provide cheaper isolation than locking.

> and makes the writes always durable (sync; not fsync)?

For durability fsync is sufficient if you do it on the file and the directory. To combine atomic and durable you can do the write, fsync, rename, fsync dir dance.

https://lwn.net/Articles/457667/

fnord123 · on June 15, 2017

>For durability fsync is sufficient if you do it on the file and the directory. To combine atomic and durable you can do the write, fsync, rename, fsync dir dance.

Right. This allows for the correct behaviour but I wouldn't conclude that POSIX is a transactional API. I would regard this as being able to build a transactional API on top of it; but not transactional in itself. I mean, if you had to do the same dance in SQL (insert to dummy pkey and then update the entry) you would rightly dismiss it.

the8472 · on June 15, 2017

Which is why I said

> You can build multi-step transactions on top of them with various operations.

phamilton · on June 14, 2017

That's not really a true statement. I use plenty of databases that aren't transactional beyond a single record.

_lqaf · on June 14, 2017

It is a true statement. You're not a member of the set of "most people":

> a basic property that most people consider necessary

I use non-ACID stores for some purposes, but don't consider them real databases.

justinclift · on June 14, 2017

Interesting project. Although that one seems pretty dead (2 years since last commit), this fork of it seems actively developed:

https://github.com/konstruxi/ngx_postgres

sbov · on June 14, 2017

It's pretty easy to use a CDN with blob storage, which reduces the impact of serving it from your application.

internalfx · on June 14, 2017

I personally prefer the database...

But I think for it to work well you need a better strategy than just large BLOBs.

GridFS for MongoDB - https://docs.mongodb.com/manual/core/gridfs/

ReGrid for RethinkDB - https://github.com/internalfx/rethinkdb-regrid

I'm sure the same concept could apply to other databases.

mpweiher · on June 14, 2017

> Also, when deciding between blob vs. filesystem, blobs bring transactional and recovery consistency.

Interesting. I would have thought the other way around, at least for crash-resistance: the I/O stack (including disk hardware) has a tendency to reorder writes and so updates that live within a single file will corrupt fairly easily. Separate files not so much. I vaguely remember a paper on that (Usenix?) and sqlite generally did very well except for that point.

rusanu · on June 14, 2017

You have two consystency issues with storing the files in the filesystem:

- rollbacks in the DB can lead to orphaned files on disk. One can try to add logic in the app (eg. a catch block that removes the file if the DB rolled back) but that is not gonna help on a crash

- it is impossible to obtain a consistent backup of both the DB and the filesystem. You can backup the filesystem and the DB, but the two will not be consistent between them unless you froze the app during the backup. When you restore the two backups (filesystem, DB) you may encounter any anomaly: orphaned files (exists on filesystem but no entry in DB), broken links (entry in DB referencing a non-existent file) etc. This is because the moment at which the backup 'views' the file and the DB record referencing it are distinct in time.

As for write reordering: write-ahead log systems relies on correct write order. All DBs worth their name enforce this one way or another (via special API, via config requirements etc etc)

Klathmon · on June 14, 2017

I'm surprised that there aren't any tools provided by various databases to handle that usecase.

Something which can abstract away the storage on-disk of large blobs and manage/maintain them over time to prevent a lot of the issues you talk about, but still give the ability for raw file access if/when it's needed.

I've given it all of 10 seconds of thought, but even something like a DB type of a file handle would be useful. Do a query, get back a handle to a file that you can treat just like you opened it yourself.

rusanu · on June 14, 2017

To name just a few:

Filestream https://docs.microsoft.com/en-us/sql/relational-databases/bl...

File Tables https://docs.microsoft.com/en-us/sql/relational-databases/bl...

Remote Blob Storage https://docs.microsoft.com/en-us/sql/relational-databases/bl...

BFILE http://docs.oracle.com/cd/E11882_01/appdev.112/e18294/adlob_...

I'm sure there are more. But rest assured, they do cost, and usually a lot.

rusanu · on June 14, 2017

> Do a query, get back a handle to a file that you can treat just like you opened it yourself

Things are a bit more complex. For one, the trivial problem of client vs. server host. The DB cannot return a handle (a FD) from the server, because it has no meaning on the host running the app. The second problem is that any file manipulation must conform to the DB semantics for transactions, locking, rollback and recovery.

What you describe does exists, is the FileStream feature that dates back to 2007 if I remember correctly. I'm describing the SQL Server feature since this is what I'm familiar with. The app queries the DB for a token, using GET_FILESTREAM_TRANSACTION_CONTEXT[0] and then uses this token to get a Win32 handle for the 'file' using OpenSqlFilestream[1]. The result handle is valid for usual file handle operations (read, write, seek etc). There were great expectations on this feature, but in real life it flopped. For one it caused all sort of operational headache from the increased DB files size (increased backups size etc) or from problems like having to investigate 'filestream thumbstone status'[2]. But more importantly, adoption required application rewrite (to use the OpenSqlFilestream), which of course never materialized.

File Tables is a newer stab at this problem and this one does allow to expose the DB files as a network share and apps can create and manipulate files on this share and everything is backed by the DB behind the scenes. But turns out a lot of apps do all sort of crazy things with the files, like copy-rename and swap as means to do failure safe saves, but many such operations are significantly more expensive in DB context. And when the DB content is manipulated directly by the apps that 'think' they interact with the filesystem, a lot of useful metadata is never collected in the DB, since the file API used never requires it (think file author, subject etc).

[0] https://docs.microsoft.com/en-us/sql/t-sql/functions/get-fil... [1] https://docs.microsoft.com/en-us/sql/relational-databases/bl... [2] https://www.sqlskills.com/blogs/paul/filestream-garbage-coll...

Klathmon · on June 14, 2017

I figured there was something massively annoying with it.

I'm hoping that PostgreSQL can try to tackle this use-case somehow because it's so damn common and 99% of the time the solution that is used completely throws out all the guarantees that the database gives you.

_lqaf · on June 14, 2017

I'm not sure there's a non-annoying way to bridge the problem. The semantics are just so different that there's a lot of friction that needs to be handled somewhere. (What does O_DIRECT mean in this context? What happens when someone decides to store files that are never normally closed?)

In some ways, the problem mirrors OR mappers. A lot of common cases can be handled, but there are always situations where you're confronted with the fact that relational logic just doesn't map cleanly to OO (or hierarchic storage).

rusanu · on June 14, 2017

Since you mention the OR impedance mismatch problem, I have to link to The Vietnam of CS article: http://blogs.tedneward.com/post/the-vietnam-of-computer-scie...

dom0 · on June 14, 2017

Databases use journaling to avoid these issues. The difference between RDBMS' journaling and file system's journaling is the RDBMS journals application data, the file system does not (with few exceptions). Thus, a RDBMS gives you consistent application state, while the file system does not (with fewer exceptions, basically only TxF).

Another pain point is backup; it's not possible to create a consistent snapshot of (rdbms state, other state on the file system). This is avoided entirely if you only need to snapshot (rdbms state,).

Aaargh20318 · on June 14, 2017

> Despite all this, my practical experience is that filesystem is better than blobs for things like uploaded content, images, pngs and jps etc.

If you store them on a filesystem, how do you deal with redundancy / failover / scaling ? At a previous job we did a bit of experimenting with using a clustered FS but they all introduced a lot of problems. However, this was a couple of years ago so the situation may be different now.

notacoward · on June 14, 2017

If you store them in sqlite, how do you deal with redundancy / failover / scaling? There are other databases with better tooling for this, but they weren't part of this comparison. The world is full of tradeoffs. Scale and availability vs. microbenchmark performance is one of them. It would be very interesting to see a comparison of clustered MySQL or Postgres vs. Gluster or Ceph. I very much doubt the performance side of it would favor the database so much.

fb03 · on June 14, 2017

> If you store them on a filesystem, how do you deal with redundancy / failover / scaling ?

In most cases, using a solution like Amazon S3 will have all these three sorted out for you pretty good. And while it's not a full mountable filesystem, from a system development perspective it has a really good API for retrieving and storing files.

eric_the_read · on June 14, 2017

Technically there is s3fs, but it's fairly terrible and I actively recommend against it. Still, it does come handy from time to time, in a limited set of cases.

nullnilvoid · on June 14, 2017

After years of development, databases and file systems borrow ideas from each other. I am not surprised that the breakeven range might be expanded. It would be interesting to re-visit the conclusions with current databases, file systems, hardware (SSD vs HDD) etc.

chiph · on June 15, 2017

Another reason to store them on the filesystem (especially if they're user-contributed) is that your anti-virus can see them there.

vonhugendong · on June 14, 2017

Thanks for this clarification

ianamartin · on June 14, 2017

I swear to god, I don't understand why people don't think about this.

total sadness about having to deal with a 150GB SQL database, of which 148 GBs are blob storage for PDFs

"We have big data!! We need enterprise scale!"

Nope. You have big stupid.

coldtea · on June 14, 2017

I don't see the sadness -- or the stupid.

There are several very valid business cases to store files as blobs in a DB.

What's the problem is the DB is 150GB? It's not like the working set (which for the files will just be the metadata) will be that big for storing file blobs.

10GB Database + 140GB of pdfs on the filesystem are not any different to a 150GB DB with everything in.

And you have other issues (consistency, transactional issues, backups, etc).

sheeshkebab · on June 14, 2017

How about a 100TB database with 95TB being pdf's? It seems there are better ways of managing lots of immutable small files (including for backups/replication) than shoving them all into a database.

coldtea · on June 14, 2017

>How about a 100TB database with 95TB being pdf's?

How about it?

>It seems there are better ways of managing lots of immutable small files (including for backups/replication) than shoving them all into a database.

Depends on the business case. A database gives certain guarantees you'll have to replicate (usually badly) in any other way.

Conceptually, it's absolutely cleaner.

As for from a scientific or engineering standpoint, there are again no laws that dictate saying whether this is bad or good.

Even the performance characteristics depend on the implementation of the particular DB storage engine. They might be totally on par with storing in the filesystem (or close enough not to matter).

Not to mention that some filesystems might even have worse overhead depending on the type, size, etc. of files. In fact this very FA speaks of "SQLite small blob storage" being "35% Faster Than the Filesystem".

Plus filesystem storage and DBs are not that different in most cases -- they share algorithms for storage and indexing (logstorage, btrees, etc), and the main difference is the access layer. The FreeBSD filesystem, for one, was more like a DB storage layer than a 70s style Unix filesystem.