Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Rarely Used Postgres Datatypes (craigkerstiens.com)
129 points by narfz on May 7, 2014 | hide | past | favorite | 67 comments


One of the coolest things about Postgres is how easy it is to make your own datatypes[1]. At previous jobs we've had custom types and aggregates that allowed super fact set operations with comparatively small data storage requirements.

[1]: http://www.postgresql.org/docs/9.3/static/sql-createtype.htm...


One of the little-noticed (yet) features of the upcoming 9.4 release greatly increases the ability to optimize moving aggregates for certain types of numerical operations, including custom aggregates. See http://www.depesz.com/2014/04/13/waiting-for-9-4-provide-mov... for a better explanation.


Jup I recommend creating [Postgres domains] for everything that is not specifically already a Postgres type.

Want to store an order id create a OrderId domain. Want to store weight create a gram domain. Want to store quantity create a quantity domain.

That way it becomes much more easy to spot logical errors and maintain data consistency.

Also one can use it to create something like an auto ORM.

[A domain Is like a user defined child type of the system types that can have some extra restrictions.]


They already have a quantity domain... it's called "integer". While I agree that domains are important and should be utilized, I hope your examples are exaggerations.


I love the idea of domains, but quantity does seem to be a poor use case.


You could use use a domain to ensure a quantity can't be less than zero.


This is also called "unsigned int". Postgres has those.


I guess quantity was a poor example but I would disagree about it being the same as integer.

And integer is a hole number. Nothing else can be inferred from knowing that something is a integer.


If your ORM doesn't let you take advantage of these goodies, maybe it's time you went past ORM? Java's JOOq library is stellar at this kind of raw SQL access stuff, while still being expressive and well-typed:

http://www.jooq.org/


How much is it?

"If you have to ask, you can't afford it."


I was surprised to see jOOQ went comercial. Still is open-source, though. The paid versions apparently just include direct support, warranty and work out-of-box with Big Corp databases.

http://www.jooq.org/licensing


What is wrong with paying for software? It's very reasonable license too. Cost-free for open source databases and without immediate support. This is classic Free Software licensing.


Slick, a DB access library for Scala, follows the same license pattern.

http://slick.typesafe.com/doc/2.0.1/extensions.html


Although, unlike us, Typesafe is completely intransparent with respect to pricing both for licensing and support subscriptions. Have you asked how much it will cost to go to production with Slick and Oracle database? You will be in for a surprise!

With jOOQ's transparent licensing strategy there are no strings attached.


There's nothing wrong with paying for software, but enterprise support is a complete black box here.


Wrong: Commercial != Open Source Correct: Commercial = Open Source

Prosgres == Commercial & Open Source [1] TRUE


Gratis for all open source databases; Postgres, MySQL, H2, Sqlite .. even NonSQLs. And worth damn every penny, should you pay for it.


Let's certainly not forget the awesome PostGIS extension with its spatial types and functions.


PostGIS absolutely deserves some attention, as it's one of if not the best GIS DB out there. It felt too big to include in here, but I should at least add some small callout.


Thanks for the article! I started using postgres because of one of your slide decks and I've learned a lot from your writing over the last year or so.


PostGIS is awesome... the geometry functions built into vanilla Postgres provide an interesting set of functionality that almost no one seems to know about. points, boxes, circles, intersection/distance/etc functions. Really cool stuff!

http://www.postgresql.org/docs/9.3/static/datatype-geometric...


Also worth mentioning that a lot of people go for PostGIS when really they just need the earthdistance module. Much more lightweight and very very useful.


Hadn't heard of it, thanks for the tip. PostGIS really shines when you need to manipulate geometry.


Hstore, and the supplied functions for converting to/from records has been very useful for making a "one size fits all" history table.

Currently I'm working with numeric range types to model order quantities and stock quantities, and forecasting which order is going to consume what stock by joining on rows with overlapping ranges, then taking the intersections of those overlaps. Again, Postgres supplies functions and operators for range overlaps, intersections, etc.

In the absence of those datatypes, there'd be a lot more work required to achieve either of these.


The ltree module is really awesome. If you have to do any hierarchical stuff I feel it's much better than nested set implementations. Great performance with gist indexes and high level operators for querying.


The ip4/ip4r data type is fantastic as well if you just need to store an ip address.

(https://github.com/RhodiumToad/ip4r-historical/blob/master/R...)


I have to add that RhodiumToad is a class act on IRC, and an SQL wizard. I was using the ip4r datatype for something really annoying at work, and eventually needed some hairy SQL. He steered me away from recursive queries into window functions, and wrote a correct query to solve my problem (basically normalizing overlapping IP ranges and some associated metadata with each range) in less than 30 minutes.

I really couldn't say how much time and money he saved me, but it was a lot.


Joe, this is definitely an awesome datatype, what is painful is building it typically or at least it was years ago when I dealt with it. I'm sure it's gotten better but still, I'd love to see this packaged up as a contrib just like hstore.


If you run Debian (or a derivate) you can download it using apt-get from apt.postgresql.org, and then just run "CREATE EXTENSION ip4r;" to install it into a specific database.

http://wiki.postgresql.org/wiki/Apt


It is an extension. You download the extension tarball[1], make, make install, then run 'create extension ip4r'.

[1] http://pgfoundry.org/frs/download.php/3383/ip4r-extension-2....


Why aren't they getting a lot of use? Is it because they're new or ORMs don't have them yet? (Note: This might sound like a false dichotomy but these are the only 2 reasons I know so feel free to add more reasons.)


Hstore is definitely an odd one in that it's an extension so sort of core, sort of not. Other's its likely just that people default to what the ORM does. Hstore actually sees 13% usage across production databases on Heroku, so while something I'd still consider it much lower than what it probably should be.


Because these are Postgres-specific features I think.

I suspect that for a lot of developers, the only database features they use are the ones their ORM of choice supports. The popular ORMs out there tend to be database-agnostic, which means they'll only natively support the lowest-common-denominator database features.


I keep seeing the uuid idea for pkeys, but have yet to see them used by any big site. Last I looked, twitter, facebook are using bigints all over the place. Also, instagram went bigint and blogged about it here http://instagram-engineering.tumblr.com/post/10853187575/sha...

Also, I tried the uuid extension once... It is not well supported, had to make a small change to a c header file to get it to compile on ubuntu, for dev on os x I think I gave up.


The "api userID" for facebook/twitter/etc. is numeric because it's always been numeric (and facebook explicitly doesn't guarantee anything about their lengths IIRC); that doesn't imply anything about what their real pkey is, just that it was numeric back in 2003. It's pretty easy to transform a UUID back and forth into a number.

I've used UUID-as-pkey and had it work well.


UUID aren't sortable by order of creation (on a per server basis which is useful in distributed algorithms), and while you'll hear a lot of talk about meteors, lightnings and winning the lottery, you do risk collisions (yes, when it comes to thousands of rps per server, millions of servers, and extended periods of time, you reach ridiculous collision rates even on a 128-bit UUID, thanks to the Birthday paradox).

UUID has no guaranteed properties you can rely on in high volumes.

The only desirable property it has is that it's an easy way out in tutorials.

The better solution is simply longer to explain (have a unique name for each server, an incremental generation number every time the server is restarted, and an incremental number on top of that for the runtime of a given generation - way Erlang does it).


Wikipedia claims birthday paradox means

"only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%."

http://en.wikipedia.org/wiki/Universally_unique_identifier#R...


That's not as much as you might think in computer systems.

Now, if you want to mark your product SKU with an UUID you have an outstanding chance it'll be unique amongst other UUID SKUs in the scope of every shop selling your SKUs.

But people have been captured by the idea UUID is truly "universally unique", so you can use it for everything, at any volume.

Not at all. Let's say that we're implementing a distributed Actor system (or a similar message passing architecture) at the scale of Google. We'll be using UUID to tag every message to guarantee its identity and various messaging properties (like deliver just once etc.), because UUID is unique. While actor systems can reach millions of messages per server per second, here we'll use a humble 100k messages per second.

- They have over 2 million servers (2,000,000)

- Each of which generates 100k messages per second.

You reach a point of 50% collision rate (for every new UUID) in 3 years:

    log2 (2,000,000 * 100,000 * 3600 * 24 * 365 * 3)= ~64
2^64 is roughly the number of 128-bit UUIDs you need to reach that collision rate due to the Birthday paradox.

Now you might say, I have a lot of variables pegged to worst case scenario. Sure I have. But this is about 50 damn percent chance of collision on every new UUID generated at that point.

Things will become bad long before 50% collision chance if you use UUID.


"But this is about 50 damn percent chance of collision on every new UUID generated at that point."

That's incorrect. If you go back to the birthday paradox, what you say would mean that, if you have 23 people in a room, and a 24th walks in, there is a 50% chance that that new person shares their birthday with one of those 23. That clearly is incorrect; that probability is at most 23/365 < 0.10.

Also, I don't see how your formula proves it. You will go through 64 of the 128 bits of key space in 3 years, but that means you only got through 2^-64 part of the key space, so each next UUID has a chance of about 2^-64 of a collision.

It's the sheer number of lottery draws with ever increasing probability of a loss that introduces the birthday paradox, not the last lottery.


Indeed, thanks for clarifying my overly dramatic point with a more proper definition of the Birthday paradox.

Yet, even once corrected, 50% chance for collision in a set of 2^64 UUID (which can occur way earlier than 2^64) is way too much for me to go to sleep at night, knowing there are more efficient, smaller (often twice smaller, ex. 64bit PK instead of 128bit PK), guaranteed collision-free ways for producing PKs.

And there's another problem with UUID collision rates. Many versions of the UUID rely on PRNG, and PRNG quality varies wildly from system to system.

A defect in the PRNG (it has happened) can start producing UUID with orders of magnitude higher collisions than the model offers. It's a problem that's better not to have.


While I know it's out of favour, you don't have to create type 4 (random) UUIDs. Including a machine identifier and timestamp in the generation of your UUID guarantees uniqueness, so long as each machine is careful not to generate two UUIDs 'simultaneously'.

This may bring issues in untrusted contexts, but it doesn't involve global (or indeed any) synchronisation outside of each machine, and you may define your 'machine' for UUID purposes to be any size you like -- one per core may work, or one per application, or per application thread even.


> "Including a machine identifier and timestamp in the generation of your UUID guarantees uniqueness, so long as each machine is careful not to generate two UUIDs 'simultaneously'."

That's a pretty big if there, especially considering how unreliable timers are, and the fact they can jump back and forth.

- Wall clock timers will jump forward or back and repeat time after NTP adjustment, or anyone else who adjusts the clock.

- Internal "elapsed time" timers, based on say the CPU clock may jump around or repeat when the source core changes on multicore machines, or you have processes sourcing their CPU clock from the core they're bound on.

- A node may be moved from machine to machine (and their clocks are not necessarily perfectly synched).

This is a hard lesson that distributed system designers learn over and over again: don't rely on random and don't rely on timers for identity. Good old incrementing counters have none of those issues and are absolutely trivial to implement. They never fail, never drift, never repeat, by definition. As I said a few posts above, the typical structure of a "unique id" based on a counter has three counters:

    1. nodename (or machine name if you will)

    2. namespace (you can have one per thread to avoid contention issues)

    3. local (monotonically increases within the namespace).
Let's say your node id is 256, and it runs on 4 threads, one per core + 1 thread for the scheduler. You need five namespaces:

256.0-4.* (where * is increasing monotonically 0, 1, 2, 3, 4...)

And once you have to reboot the node, you obtain five new namespaces:

256.5-9.*

And the * counters start over from zero on each. As for how big should each segment be, that depends on what you want to do with the nodes I suppose. But the 128-bit UUID size should never be some kind of guide regarding size; maybe you need three shorts, maybe you need three longs, maybe short.medium.long etc., it'll depend on the use case.


But that probability includes the possibility for the first generated message to collide with the last one, and to store that many messages (100k/sec, 2mill servers, 8 bytes/UUID) is 4*10^17 (400 Petabyte) allone.

I guess most of the messages created in such a system will be emphemeral...


the only bit I'd prefer clarification on is:

Timestamp with Timezone

There’s seldom a case you shouldn’t be using these.

For recording most event times that's true but:

Birthdates (which can include times) shouldn't use them.

Calendaring should think very carefully too, should the 9am meeting move to 10am due to DST? or should it "stay put" but potentially move for other timezones?


Definitely some great points and valid cases. In the calendaring case I actually may enjoy the shift, but that would probably be a surprise to most. Will work to add these examples where you wouldn't want the with timezone behavior in.


Birthdates (which can include times) shouldn't use them.

I don't understand why you don't want to use timezone with birthdates(with time). Can you explain this, please?


I think because the common usage of birthdates doesn't take the timezone into account. e.g. if you were born on January 1st 9pm PST, that's January 2nd 12am EST – but your birthday is still considered January 1st no matter where you are in the world. So trying to correct such dates for timezones will show users their incorrect birthdate.


Birthdates are particularly difficult to handle properly in a database if there is even a slight chance that you'll need to enter a foreign immigrant into the data.

Just for instance, older Iraqis (if I remember correctly) tend to be part of a culture where their individual birthdates were never recorded, and each person was considered 1 year older on some September date (maybe even talking the Islamic calendar there too). Should you store that date in your database, pretending that it means the same thing as an American who knows they were born at 7:08pm EST on July 17th, 1977?

And that's just one example of many.

I think I learned about this one because intelligence officials were freaking out when they eventually noticed all these people sharing a "birthdate", and it made the news.

Data about human beings is highly un-normalizable, as evidenced by everyone trying to stuff anthroponyms into the American "first name, middle name, last name" paradigm.


http://www.postgresql.org/docs/9.1/static/datatype-datetime....

The postgresql manpage says it loud: don't fucking use timestamp with timezone. It is bounded to political decision that makes it unusable compared to an UTC TS. Plus good practices says: store UTC present in local

Does any one Read The Fantastics Manual these days?


The docs say "don't use time with time zone".

That's very different than timestamp with time zone, which you should be using unless you have a very good reason to not.


Kind of confused here: I was under the impression that timestamp with timezone only did the converting to-and-from your system timezone, but still always stored in UTC. Is that incorrect?


You are correct.


I wish arrays and hstore-like types were available in sqlite (not through plugins, but standard-wise).

I've used something else from my experiments in PostgreSQL - the TXID - this way I was able to track down changes in the database (by keeping previous TXID and some other various bits, and then by polling again (or making the server call me)) - polling again and only instructing to get the rows that have changed since my last TXID.


The array type is actually one of the most interesting, I'd love to see the performance implications of those specifically covered in more detail.


Postgres is a great database, I'm particularly in love with hstore for the data that it makes sense for.


While I don't disagree with the coolness of uuid, how exactly do you 'run out of integers' ?


Here's a good example:

http://slashdot.org/story/06/11/09/1534204/slashdot-posting-...

Slashdot got over 16,777,216 comments overflowing a MySQL mediumint they were using for an index.



It's called transaction id wraparound, and you don't want to allow it to happen.

http://www.postgresql.org/docs/9.3/static/routine-vacuuming....

Fortunately the PostgreSQL has autovacuum so you probably don't need to worry about this.


Yeah, running out of a bigint in PostgreSQL would be hard since they are 63 bits + sign.


I think the money type is awesome! Under the hood it uses int (I believe int32 even on 64bit) so it takes 50% less space than decimal.

So if your only dealing with US currency, why not love the money type?


Postgres developers disagree and recommend numeric. It was once deprecated and then un-deprecated.

http://www.postgresql.org/message-id/b42b73150805270629h309f...


I don't know, I followed the thread and it didn't seem conclusive, esp with the main criticizer stating "OTOH, it's a neat example of a fixed precision integer type"

I use it a lot, and yes, it is missing some casts but you can cast to decimal and go from there:

select '42.00'::money::decimal;

I've also used stripe, and maybe it's a javascript thing, but they represent all their money in cents. Personally, I'd rather use the money type, it is easy for me to reason about.


Javascript represents all numbers as IEEE 754 floating point. Base 2 floating point isn't good for representing base 10 decimals.


Because pretty soon you will deal with a non US currency.....

For a "datatype" as commonly needed as "money", it has always struck me as strange that a more standardized and widely agreed upon approach to dealing with them, have not emerged.


Why would it make any difference if one used different currency's.

As the currency of money is not known it is pretty obvious that one needs to store that in another column.


Yes, amazing. About to work on stats caching and that Ranges datatype will be suuuper useful for using ranges of timestamps when caching a stat every week, month, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: