One of the coolest things about Postgres is how easy it is to make your own datatypes[1]. At previous jobs we've had custom types and aggregates that allowed super fact set operations with comparatively small data storage requirements.
One of the little-noticed (yet) features of the upcoming 9.4 release greatly increases the ability to optimize moving aggregates for certain types of numerical operations, including custom aggregates. See http://www.depesz.com/2014/04/13/waiting-for-9-4-provide-mov... for a better explanation.
They already have a quantity domain... it's called "integer". While I agree that domains are important and should be utilized, I hope your examples are exaggerations.
If your ORM doesn't let you take advantage of these goodies, maybe it's time you went past ORM? Java's JOOq library is stellar at this kind of raw SQL access stuff, while still being expressive and well-typed:
I was surprised to see jOOQ went comercial. Still is open-source, though. The paid versions apparently just include direct support, warranty and work out-of-box with Big Corp databases.
What is wrong with paying for software? It's very reasonable license too. Cost-free for open source databases and without immediate support. This is classic Free Software licensing.
Although, unlike us, Typesafe is completely intransparent with respect to pricing both for licensing and support subscriptions. Have you asked how much it will cost to go to production with Slick and Oracle database? You will be in for a surprise!
With jOOQ's transparent licensing strategy there are no strings attached.
PostGIS absolutely deserves some attention, as it's one of if not the best GIS DB out there. It felt too big to include in here, but I should at least add some small callout.
PostGIS is awesome... the geometry functions built into vanilla Postgres provide an interesting set of functionality that almost no one seems to know about. points, boxes, circles, intersection/distance/etc functions. Really cool stuff!
Also worth mentioning that a lot of people go for PostGIS when really they just need the earthdistance module. Much more lightweight and very very useful.
Hstore, and the supplied functions for converting to/from records has been very useful for making a "one size fits all" history table.
Currently I'm working with numeric range types to model order quantities and stock quantities, and forecasting which order is going to consume what stock by joining on rows with overlapping ranges, then taking the intersections of those overlaps. Again, Postgres supplies functions and operators for range overlaps, intersections, etc.
In the absence of those datatypes, there'd be a lot more work required to achieve either of these.
The ltree module is really awesome. If you have to do any hierarchical stuff I feel it's much better than nested set implementations. Great performance with gist indexes and high level operators for querying.
I have to add that RhodiumToad is a class act on IRC, and an SQL wizard. I was using the ip4r datatype for something really annoying at work, and eventually needed some hairy SQL. He steered me away from recursive queries into window functions, and wrote a correct query to solve my problem (basically normalizing overlapping IP ranges and some associated metadata with each range) in less than 30 minutes.
I really couldn't say how much time and money he saved me, but it was a lot.
Joe, this is definitely an awesome datatype, what is painful is building it typically or at least it was years ago when I dealt with it. I'm sure it's gotten better but still, I'd love to see this packaged up as a contrib just like hstore.
If you run Debian (or a derivate) you can download it using apt-get from apt.postgresql.org, and then just run "CREATE EXTENSION ip4r;" to install it into a specific database.
Why aren't they getting a lot of use? Is it because they're new or ORMs don't have them yet? (Note: This might sound like a false dichotomy but these are the only 2 reasons I know so feel free to add more reasons.)
Hstore is definitely an odd one in that it's an extension so sort of core, sort of not. Other's its likely just that people default to what the ORM does. Hstore actually sees 13% usage across production databases on Heroku, so while something I'd still consider it much lower than what it probably should be.
Because these are Postgres-specific features I think.
I suspect that for a lot of developers, the only database features they use are the ones their ORM of choice supports. The popular ORMs out there tend to be database-agnostic, which means they'll only natively support the lowest-common-denominator database features.
I keep seeing the uuid idea for pkeys, but have yet to see them used by any big site. Last I looked, twitter, facebook are using bigints all over the place. Also, instagram went bigint and blogged about it here http://instagram-engineering.tumblr.com/post/10853187575/sha...
Also, I tried the uuid extension once... It is not well supported, had to make a small change to a c header file to get it to compile on ubuntu, for dev on os x I think I gave up.
The "api userID" for facebook/twitter/etc. is numeric because it's always been numeric (and facebook explicitly doesn't guarantee anything about their lengths IIRC); that doesn't imply anything about what their real pkey is, just that it was numeric back in 2003. It's pretty easy to transform a UUID back and forth into a number.
UUID aren't sortable by order of creation (on a per server basis which is useful in distributed algorithms), and while you'll hear a lot of talk about meteors, lightnings and winning the lottery, you do risk collisions (yes, when it comes to thousands of rps per server, millions of servers, and extended periods of time, you reach ridiculous collision rates even on a 128-bit UUID, thanks to the Birthday paradox).
UUID has no guaranteed properties you can rely on in high volumes.
The only desirable property it has is that it's an easy way out in tutorials.
The better solution is simply longer to explain (have a unique name for each server, an incremental generation number every time the server is restarted, and an incremental number on top of that for the runtime of a given generation - way Erlang does it).
That's not as much as you might think in computer systems.
Now, if you want to mark your product SKU with an UUID you have an outstanding chance it'll be unique amongst other UUID SKUs in the scope of every shop selling your SKUs.
But people have been captured by the idea UUID is truly "universally unique", so you can use it for everything, at any volume.
Not at all. Let's say that we're implementing a distributed Actor system (or a similar message passing architecture) at the scale of Google. We'll be using UUID to tag every message to guarantee its identity and various messaging properties (like deliver just once etc.), because UUID is unique. While actor systems can reach millions of messages per server per second, here we'll use a humble 100k messages per second.
- They have over 2 million servers (2,000,000)
- Each of which generates 100k messages per second.
You reach a point of 50% collision rate (for every new UUID) in 3 years:
2^64 is roughly the number of 128-bit UUIDs you need to reach that collision rate due to the Birthday paradox.
Now you might say, I have a lot of variables pegged to worst case scenario. Sure I have. But this is about 50 damn percent chance of collision on every new UUID generated at that point.
Things will become bad long before 50% collision chance if you use UUID.
"But this is about 50 damn percent chance of collision on every new UUID generated at that point."
That's incorrect. If you go back to the birthday paradox, what you say would mean that, if you have 23 people in a room, and a 24th walks in, there is a 50% chance that that new person shares their birthday with one of those 23. That clearly is incorrect; that probability is at most 23/365 < 0.10.
Also, I don't see how your formula proves it. You will go through 64 of the 128 bits of key space in 3 years, but that means you only got through 2^-64 part of the key space, so each next UUID has a chance of about 2^-64 of a collision.
It's the sheer number of lottery draws with ever increasing probability of a loss that introduces the birthday paradox, not the last lottery.
Indeed, thanks for clarifying my overly dramatic point with a more proper definition of the Birthday paradox.
Yet, even once corrected, 50% chance for collision in a set of 2^64 UUID (which can occur way earlier than 2^64) is way too much for me to go to sleep at night, knowing there are more efficient, smaller (often twice smaller, ex. 64bit PK instead of 128bit PK), guaranteed collision-free ways for producing PKs.
And there's another problem with UUID collision rates. Many versions of the UUID rely on PRNG, and PRNG quality varies wildly from system to system.
A defect in the PRNG (it has happened) can start producing UUID with orders of magnitude higher collisions than the model offers. It's a problem that's better not to have.
While I know it's out of favour, you don't have to create type 4 (random) UUIDs. Including a machine identifier and timestamp in the generation of your UUID guarantees uniqueness, so long as each machine is careful not to generate two UUIDs 'simultaneously'.
This may bring issues in untrusted contexts, but it doesn't involve global (or indeed any) synchronisation outside of each machine, and you may define your 'machine' for UUID purposes to be any size you like -- one per core may work, or one per application, or per application thread even.
> "Including a machine identifier and timestamp in the generation of your UUID guarantees uniqueness, so long as each machine is careful not to generate two UUIDs 'simultaneously'."
That's a pretty big if there, especially considering how unreliable timers are, and the fact they can jump back and forth.
- Wall clock timers will jump forward or back and repeat time after NTP adjustment, or anyone else who adjusts the clock.
- Internal "elapsed time" timers, based on say the CPU clock may jump around or repeat when the source core changes on multicore machines, or you have processes sourcing their CPU clock from the core they're bound on.
- A node may be moved from machine to machine (and their clocks are not necessarily perfectly synched).
This is a hard lesson that distributed system designers learn over and over again: don't rely on random and don't rely on timers for identity. Good old incrementing counters have none of those issues and are absolutely trivial to implement. They never fail, never drift, never repeat, by definition. As I said a few posts above, the typical structure of a "unique id" based on a counter has three counters:
1. nodename (or machine name if you will)
2. namespace (you can have one per thread to avoid contention issues)
3. local (monotonically increases within the namespace).
Let's say your node id is 256, and it runs on 4 threads, one per core + 1 thread for the scheduler. You need five namespaces:
And once you have to reboot the node, you obtain five new namespaces:
256.5-9.*
And the * counters start over from zero on each. As for how big should each segment be, that depends on what you want to do with the nodes I suppose. But the 128-bit UUID size should never be some kind of guide regarding size; maybe you need three shorts, maybe you need three longs, maybe short.medium.long etc., it'll depend on the use case.
But that probability includes the possibility for the first generated message to collide with the last one, and to store that many messages (100k/sec, 2mill servers, 8 bytes/UUID) is 4*10^17 (400 Petabyte) allone.
I guess most of the messages created in such a system will be emphemeral...
There’s seldom a case you shouldn’t be using these.
For recording most event times that's true but:
Birthdates (which can include times) shouldn't use them.
Calendaring should think very carefully too, should the 9am meeting move to 10am due to DST? or should it "stay put" but potentially move for other timezones?
Definitely some great points and valid cases. In the calendaring case I actually may enjoy the shift, but that would probably be a surprise to most. Will work to add these examples where you wouldn't want the with timezone behavior in.
I think because the common usage of birthdates doesn't take the timezone into account. e.g. if you were born on January 1st 9pm PST, that's January 2nd 12am EST – but your birthday is still considered January 1st no matter where you are in the world. So trying to correct such dates for timezones will show users their incorrect birthdate.
Birthdates are particularly difficult to handle properly in a database if there is even a slight chance that you'll need to enter a foreign immigrant into the data.
Just for instance, older Iraqis (if I remember correctly) tend to be part of a culture where their individual birthdates were never recorded, and each person was considered 1 year older on some September date (maybe even talking the Islamic calendar there too). Should you store that date in your database, pretending that it means the same thing as an American who knows they were born at 7:08pm EST on July 17th, 1977?
And that's just one example of many.
I think I learned about this one because intelligence officials were freaking out when they eventually noticed all these people sharing a "birthdate", and it made the news.
Data about human beings is highly un-normalizable, as evidenced by everyone trying to stuff anthroponyms into the American "first name, middle name, last name" paradigm.
The postgresql manpage says it loud: don't fucking use timestamp with timezone.
It is bounded to political decision that makes it unusable compared to an UTC TS.
Plus good practices says: store UTC present in local
Does any one Read The Fantastics Manual these days?
Kind of confused here: I was under the impression that timestamp with timezone only did the converting to-and-from your system timezone, but still always stored in UTC. Is that incorrect?
I wish arrays and hstore-like types were available in sqlite (not through plugins, but standard-wise).
I've used something else from my experiments in PostgreSQL - the TXID - this way I was able to track down changes in the database (by keeping previous TXID and some other various bits, and then by polling again (or making the server call me)) - polling again and only instructing to get the rows that have changed since my last TXID.
I don't know, I followed the thread and it didn't seem conclusive, esp with the main criticizer stating "OTOH, it's a neat example of a fixed precision integer type"
I use it a lot, and yes, it is missing some casts but you can cast to decimal and go from there:
select '42.00'::money::decimal;
I've also used stripe, and maybe it's a javascript thing, but they represent all their money in cents. Personally, I'd rather use the money type, it is easy for me to reason about.
Because pretty soon you will deal with a non US currency.....
For a "datatype" as commonly needed as "money", it has always struck me as strange that a more standardized and widely agreed upon approach to dealing with them, have not emerged.
Yes, amazing. About to work on stats caching and that Ranges datatype will be suuuper useful for using ranges of timestamps when caching a stat every week, month, etc.
[1]: http://www.postgresql.org/docs/9.3/static/sql-createtype.htm...