Jepsen: Distributed Systems Safety Analysis

willchen · on Oct 21, 2015

I'd be very interested to see RethinkDB analyzed, particularly with the 2.1 release promising high availability through Raft. RethinkDB and Aphyr have talked about doing Jepsen tests for it, but I'm not sure where that's landed (https://github.com/rethinkdb/rethinkdb/issues/1493).

coffeemug · on Oct 21, 2015

Slava @ Rethink here. Talked to Kyle yesterday, he's doing RethinkDB analysis next. You should see something in a month or two.

EDIT: don't mean to speak for him though; I hope this is ok.

Spiritus · on Oct 21, 2015

Will that be on 2.1 or the upcoming 2.2?

coffeemug · on Oct 21, 2015

Probably on the 2.1 branch, but it shouldn't matter too much -- the clustering codebase is very stable and there aren't any major changes to it between releases.

OMGWTF · on Oct 21, 2015

> Flash plugin missing

> Get the latest Flash player to view this content

No, I won't. This scares me too much: https://www.cvedetails.com/vulnerability-list/vendor_id-53/p...

annnnd · on Oct 21, 2015

+1.

Not sure why anyone would want to play video (I guess it's a video? I don't do Flash either) through Flash nowadays... Isn't it easier and better to just use HTML video tag?

biokoda · on Oct 21, 2015

With live video you don't really have a choice if you want compatibility with all browsers.

annnnd · on Oct 21, 2015

Not really - there is no flash on (most?) mobile devices.

You can also use JS detection. For instance, Youtube uses Flash unless you allow google.com & googlevideo.com, then it uses HTML video tag. It's not like it costs you anything to add HTML video tag.

biokoda · on Oct 21, 2015

live video. There is no cross browser compatible format for live streaming.

annnnd · on Oct 21, 2015

Sorry, missed "live" word. I don't have any (recent) experience with that, so can't comment on that. But in case of OP's page the video is not live.

mdaniel · on Oct 21, 2015

You can also submit your preference to YouTube, even if you _do_ have Flash: https://www.youtube.com/html5

fowlduck · on Oct 21, 2015

I'll transcode the vid and set up videojs properly, if it'll help

mdaniel · on Oct 21, 2015

I especially enjoyed this comment from a user entitled OMGWTF :-)

striking · on Oct 21, 2015

A lot of people are noting their interest in seeing a certain database tested. Sorry to say so, but:

>> Can you test X next?

> Tests take about a month. I do take suggestions into consideration, but I can't promise you anything. Backlog is a few years long at this point.

(from https://aphyr.com/about)

mcfunley · on Oct 21, 2015

He's starting a company around jepsen so it's a bit different now:

> For folks who would like to pay for a Jepsen analysis, email aphyr+jepsen@aphyr.com. I've got one client lined up but I'll keep in touch. :)

https://twitter.com/aphyr/status/656279505879744512

willchen · on Oct 21, 2015

Fair point. For RethinkDB, I think there's a good business case for getting it tested sooner than later (on both sides). For aphyr who's leaving Stripe, I think it would be reasonable for RethinkDB to compensate him for an unbiased study, given that they have $12M+ in funding. A positive report could generate a lot of attention for RethinkDB and give confidence to larger enterprises who are very hesitant of trying anything remotely "new" or "experimental" in the DB space.

rdtsc · on Oct 21, 2015

> A positive report could generate a lot of attention for RethinkDB

A negative report could generate the opposite though. Are they ready to bet the new $12M+ on that?

Then also if they compensate him, not sure how unbiased others will think (and we are talking about perception here) the study will be.

coffeemug · on Oct 21, 2015

> A negative report could generate the opposite though. Are they ready to bet the new $12M+ on that?

Slava at RethinkDB here. Yes, we are.

rdtsc · on Oct 21, 2015

Good answer! I want to see the results.

BinaryIdiot · on Oct 21, 2015

> A negative report could generate the opposite though. Are they ready to bet the new $12M+ on that?

I disagree; it all depends on how they act upon it.

So worst case scenario he finds something deeply flawed. What does RethinkDB do? Work to fix it while communicating and being responsive to the development community. Show developers that you're humble and, more importantly, you listen and fix things. That would generate a good amount of good will, in my opinion.

annnnd · on Oct 21, 2015

Exactly! And if some company is reluctant to do this in the open, they could use him for QA. That is, pay him to do a preliminary test, then fix things, run the tests again, fix again,... until there are no more critical issues. Of course the resulting blog post should then outline the whole process.

Everyone makes mistakes - what matters is what you do to discover and fix them.

rdtsc · on Oct 21, 2015

Good point. For example I remember etcd and Consul devs responded and fixed a few issues, after the test came out.

nosequel · on Oct 21, 2015

Up to this point, nothing in Aphyr's history has suggested he'd be a biased tester. When he tested Riak, he was invited to Ricon to give the results for the world to see live. Much of the Basho engineers at the time were personal friends of his. As you know if you read that analysis, Aphyr held back nothing.

Right now I'd trust Aphyr more than any other analyst, consultant, or full-time employee (with employment to lose) to give truly impartial judgement.

wpietri · on Oct 21, 2015

No offense meant to Aphyr, whose work is awesome, but conflicts of interest are incredibly corrosive. The sales cycle for consulting services requires a "gotta keep the customer happy" attitude. That's hard to square with truly impartial judgment. It's not that you give in to the dark side; once you take somebody's money you're no longer impartial, so you end up having to try to simulate it.

It's even harder to square with the appearance of impartial judgment. What if a product is honestly great? If we decide that he's being fair because he says both positive and negative things, he has an incentive to create the appearance of fairness rather than actual fairness.

There's a reason that systems where truth is important (e.g., legal proceedings, medicine, science) have complicated rules around conflicts of interest. I hope he finds a way to find funding that doesn't put either his judgment or his reputation at risk. Consider, e.g., Consumer Reports: they get money from their subscribers, not advertisers or manufacturers. Maybe he could let database users fund all his public Jepsen work.

sitkack · on Oct 21, 2015

And if he deviated from his format of brutal honesty we would know something is up and his brand would be tarnished. Getting a B on Jepsen and fixing it is probably the best that any vendor can hope for at this time.

rdtsc · on Oct 21, 2015

It is about perception. I think he's unbiased, you do as well it seems. But to others it might look biased. That was my point.

nosequel · on Oct 21, 2015

Oh yeah, I'm with you completely. I trust that he will stay impartial, and I hope that vendors allow him the freedom to write as he has in the past so that all appearances reflect what I believe to be is the reality.

presty · on Oct 21, 2015

> For aphyr who's leaving Stripe

huh, you mean Factual

venantius · on Oct 21, 2015

He's been at Stripe for the past 10 months.

presty · on Oct 21, 2015

I stand corrected. I now see that a bunch of people left, wth happeened w/ Factual's SF office?

prospero · on Oct 21, 2015

It's still going, less a few people.

headcanon · on Oct 21, 2015

I wonder if Carly Rae Jepsen is aware of the legacy of her song in the software world...

TeMPOraL · on Oct 21, 2015

Somebody call her maybe?

annnnd · on Oct 21, 2015

Especially as the product is named after her... This for sure doesn't hurt her image, but some lawyers might have a problem with this.

JoelJacobson · on Oct 21, 2015

Why isn't PostgreSQL in the list on the first page? It is in the [blog posts] page though.

I would love to see the multixact data corruption problems introduced in 9.3 analyzed, and see if he can verify them to be solved in the latest version.

rdtsc · on Oct 21, 2015

(Not author) but from what I remember PostreSQL was a bit of a funny case. It was about partitions between clients and a single database server. In most other cases, it was analysis of distributed back-end server instances. Not that those failures are not important but it is just harder to compare it seems with the others.

lomnakkus · on Oct 21, 2015

Yeah, basically with PostgreSQL it boils down to the fact that even if a client receives an error on commit, its writes may in fact have succeeded -- and it cannot assume that they haven't. (Since the final "ack" that everything went well may not make it.)

As an example: This means that idempotency is really important if you're doing any kind of retries on commit failure.

EDIT: Another example: If you're doing application-level optimistic concurrency control (using e.g. an incrementing version number) you can end up in a situation where an update appears to conflict with itself. Usually, though, this isn't a big concern since client<->server paritions are usually catastrophic anyway.

(Off the top of my head that's what I can recall -- there may be other things to be gleaned from the original article.)

[1] At least for all practical purposes.

biokoda · on Oct 21, 2015

> Yeah, basically with PostgreSQL it boils down to the fact that even if a client receives an error on commit, its writes may in fact have succeeded -- and it cannot assume that they haven't.

This is true for almost every database. Otherwise they would require a distributed transaction between the client and server.

lomnakkus · on Oct 23, 2015

Indeed, but given that we were talking about $DISTSYS vs. $CONCRETE_THING, I just mentioned the $CONCRETE_THING by name.

You are of course technically right... which is the best kind of right! :)

amitlan · on Oct 21, 2015

Not quite sure whether something like multixact data corruption is symptomatic of the kind of underlying issues in database system implementations that Jepsen is after. I may be wrong though.

Jweb_Guru · on Oct 21, 2015

It wasn't. It's also rather difficult to trigger. I have no doubt that aphyr could do stuff like that if he so chose, but I think pounding on getting a reproducible test case for Postgres's subtle serializable bug would be more his speed.

threeseed · on Oct 21, 2015

PostgreSQL was tested in single mode. All others distributed.

Apples and Oranges.

jhasse · on Oct 21, 2015

Please make the title more verbose.

885895 · on Oct 21, 2015

Either OP or mods have corrected the title now.

eddd · on Oct 21, 2015

I really dig the Riak test. One of the best comparison of LWW and CRDT. Also very good conclusion, tl; dr: If you don't use idempotent writes, you are going to have bad time.

rspeer · on Oct 21, 2015

Well, Last Write Wins is idempotent too. I would say the conclusion is that it's important that your writes are commutative and associative.

(The problem, then, is that data structures with all these properties are hard to find.)

felixgallo · on Oct 21, 2015

On a human note: Kyle is brave as fuck for stepping out into the black unknown yawning abyss to try to do this under his own banner. Bravo, Kyle, and may you find incredible success.

mdaniel · on Oct 21, 2015

The other side to that coin is that these reports are always in the top of the front page of HN, and likely on Reddit, also. They are popular because they are great, and I don't want to live in a world where people can't be paid fairly to do great things.

doomrobo · on Oct 21, 2015

Was Jespen developed at Stripe? If so, how can aphyr get the rights to use Jespen for individual contracting work?

masklinn · on Oct 21, 2015

Jepsen was developed by aphyr/kyle. Stripe hired him early this year (January 2015[0]), I understand in part to support his Jepsen work, but he's been working on it since 2013[1].

So to answer the question, no Jepsen was not "developed at stripe" though part of its development probably happened there.

[0] https://twitter.com/avibryant/status/561315494204952579

[1] https://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-a...

buryat · on Oct 21, 2015

https://github.com/aphyr/jepsen/commits/master?page=14

MCRed · on Oct 21, 2015

I really would like to see Couchbase analyzed... and for aphyr to marry me.

mbesto · on Oct 21, 2015

I'm actually genuinely interested in that as well (Not the marrying part), especially considering aphyr and Damien Katz went at it about a year ago: https://twitter.com/damienkatz/status/527588456747499520

willchen · on Oct 21, 2015

Educational back-and-forth, although unnecessarily biting.

TL;DR - Aphyr thinks Damien Katz never read this: https://en.wikipedia.org/wiki/Consensus_%28computer_science%...

augustl · on Oct 21, 2015

What I find really interesting is that while it appears to me that Katz has very sparse knowledge of databases, he has still managed to create one or two of them, and successfully run a company that sells one. I don't like the thought I had next, but: perhaps knowing about consensus algos and papers actually doesn't matter?

Saavedro · on Oct 21, 2015

well, matter for what? For building software that seems to work, sure, no. But for being able to speak to its capabilities -- or even know what they are, especially in the case of distributed systems it's definitely important.

this is why so many of the new DB offerings have docs that are either misleading or wrong -- often the doc writers or even the devs themselves don't know what invariants their system can guarantee and assume several that it can't.

DomreiRoam · on Oct 21, 2015

Depends on the problem you want to solve. CouchDB is a very good document database when you have more read than write and when writer are mostly writing on different documents. If I understand correctly replication conflict resolution is up to the application. So knowing about consensus algo is unnecessary for CouchDB because he doesn't support it.

I think it's smart to know a little about the literature; enough to know what battle you don't want to fight. It is not smart to try to develop an engine on the basis that it will need to beat the law of thermodynamic to be possible.

But also, “They did not know it was impossible so they did it”, Mark Twain

krugster · on Oct 21, 2015

FWIW, Couchbase != CouchDB: http://www.couchbase.com/couchbase-vs-couchdb

+1 For Jepson on Couchbase

Jweb_Guru · on Oct 21, 2015

Kyle's point there is that it does matter, in that CouchDB can't achieve consensus, and would have failed Jepsen immediately.

votemedown · on Oct 21, 2015

The only thing that passes Jespen: vim.