So many thoughts on this. The community has definitely ebbed and flowed, on this...

andrenotgiant · on Oct 10, 2023

I have the honor of working with a Postgres ~committer~ contributor who was just over 25 when they first contributed! The story about their first commit is great:

They were testing SQL behavior for Materialize and thought to check that both systems handle interval functions identically. Being thorough, they tried something like:

  select interval '0.5 months 2147483647 days';

You can try it yourself on dbfiddle[0] Instead of erroring, Postgres returned a bogus value `{"days":-2147483634}` you can read why here[1]

So naturally they decided to fix it in Postgres, which is why they contributed and why it's handled properly in 15+ [2]

[0] https://www.db-fiddle.com/f/ijT76fsmL99bHvXxhAtf7j/0 [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit... [2] https://www.db-fiddle.com/f/i3KikCb72AN1EZpywErZvr/1

koolba · on Oct 10, 2023

> I have the honor of working with a Postgres committer ...

That's not a committer, that's someone who submitted a patch that got committed. A committer is the one who actually applies the patch and can push the branch into the mainline repo. Committers decide if something is worthy of being merged.

Now that aside, yes this plus reviewing patches to get a wider feel for the codebase is how you eventually become a committer.

Best way to eat an elephant is one bite at at time.

craigkerstiens · on Oct 10, 2023

This is a common source of confusion for a ton of folks. Anyone can submit a patch, but commit bits are reserved for a much smaller list. The attitude is something like you commit it, you maintain it–so if bugs come in you'll spend your time fixing those for whatever time it takes vs. working on the next shiny feature that you're excited about for the next release.

There was sort of a fuzzy "major" contributors (https://www.postgresql.org/community/contributors/) which were people that contributed major features and then a list of other contributors. Depending on who you talk to this is either dated or a pretty close attempt at reflection of reality but not perfect. In recent years they expanded the contributors to include others that were contributing in non-code ways though it's still a decent place to find people contributing to major feature sets.

Of course this is not to be confused with the core team–which is more like a steering committee. But not so much steering committee of code and feature sets.

andrenotgiant · on Oct 10, 2023

Ahh thanks for clarifying - now I better understand the significance of the OP's point about the rarity of younger COMMITTER's.

gavinray · on Oct 10, 2023

The thing about becoming a PG contributor is that the barrier to entry is fairly high.

I love Postgres so much I have a PG tattoo, but from the perspective of the two ways you can contribute:

- As a random user, in your free time: There's not a ton of "Good first issue" type tickets. Where you can ease your way into PG dev by working on something that doesn't require you to have context on many parts of the PG architecture and at least a little historical knowledge on why things are written the way they are. Also, it can be a bit intimidating to have your patches reviewed by the likes of Tom or Andres.

- As a developer for a paid PG company like EDB/PG Pros/Crunchy etc: It's a sort of Catch-22 scenario here, where it's difficult to get hired as a junior without having previous PG hacking experience, but the path to doing that is not the easiest thing in the world.

If I was going to work somewhere that wasn't $CURRENT_CO, it'd be somewhere doing PG work, but there's not a lot of viable avenues/inroads there.

hlinnaka · on Oct 10, 2023

PostgreSQL isn't that special as a codebase. Every codebase has its quirks, every project has its own processes and there's a learning curve. When you switch to a new job as a software engineer, you pick it up. PostgreSQL is no different: you can hire an engineer to work on PostgreSQL.

I'm not sure how well that path works in growing new contributors, though. In a usual company setting, the goals are better defined, and the company is in control. Once you reach the goals, mission accomplished. With an open source project it's more nebulous. Others might have different criteria and different priorities. You are not in control. Choosing the right problems to work on is important.

Other storage or database projects would be a good source of new contributors. If you have worked on another DBMS, you're already familiar with the domain, and the usual techniques and tradeoffs. But to stick around, you need some internal desire to contribute, not just achieve some specific goals.

harikb · on Oct 10, 2023

The biggest hurdle I see is that it is a C project, unfortunately something we can do nothing about. It is so much harder to trust a random code not have to have serious implications for the database. It will take ages for someone to get comfortable with the pg-code-base way of handling errors, basic string manipulation, memory alloc/free etc.

I want to highlight the difference in "making a non-core contribution" to "understanding database internals". I am highlighting it is not the latter, but the former that is the first hurdle.

I wanted to reuse builtin pg code to parse the printed statements from logs - I ended up writing a parser (in a non-C language) myself which was faster.

gavinray · on Oct 10, 2023

Couple of points in this post, so will address a few of them:

  "(Paraphrased) C is bad, and it takes forever to pick up the PG-specific C idioms"

There's probably not a productive conversation to be had about C as a language. I will say that as of C23, the language is not quite as barebones as it used to be and incorporates a lot of modern improvements.

On the topic of PG-specific C -- there are a handful of replacements for common operations that you use in PG. Things like "palloc/pfree", and the built-in macros for error and warning logging, etc.

I genuinely don't think it would take a motivated party more than a day or two to pick all of these up -- there aren't that many of them and they tend to map to things you're already used to.

  "I wanted to reuse builtin pg code to parse the printed statements from logs - I ended up writing a parser (in a non-C language) myself which was faster."

It's true that the core PG code isn't written in a modular way that's friendly to integration piecemeal in other projects (outside of libpq).

For THIS PARTICULAR case, the pganalyze team has actually extracted out the parser of PG for including in your own projects:

https://github.com/pganalyze/libpg_query

zxexz · on Oct 10, 2023

libpg_query is a godsend of a library. I spent a lot of time writing a custom parser before I found it - was very happy to replace the whole thing. A major boon was the fingerprinting ability - one of my needs was to track query versions in metadata.

craigkerstiens · on Oct 10, 2023

I disagree on this. Yes it's C. But I've heard people comment "I don't like writing C, but I don't mind Postgres C".

The bigger hurdle which Peter mentioned in another thread is simply building up enough expertise with the system and having the right level of domain expertise.

stouset · on Oct 10, 2023

> Yes it's C. But I've heard people comment "I don't like writing C, but I don't mind Postgres C".

While "Postgres C" might be wonderful, in practice learning the project's unique idioms is yet another hurdle for newcomers to overcome.

eatonphil · on Oct 10, 2023

Every project has unique idioms. Let alone ones that are 30+ years old.

Idioms are a baked in cost of learning to contribute to any project.

fanf2 · on Oct 10, 2023

I found that I learned a lot when trying to write a logical decoding plugin. So I guess if you are a user of Postgres and there’s some small friction you could reduce by writing a plugin, it’s a good way to get started. Scratch your own itch, you don’t have to publish the results :-)

samaysharma · on Oct 10, 2023

I don't have the data for the average age, but I was recently in a conversation around how long does it take to become a committer since getting involved in Postgres by writing code for it.

So, I wrote a couple git commands like below [1] to figure out when someone was first named in a commit message vs when they made their first commit (as a committer) for the last 10 people who became committers.

The average time of involvement was ~8.9 years (just comparing month / year), with the lowest being ~6.5 years.

Obviously one could do better analysis but my goal was just to get an approximate understanding.

[1] git log --grep 'Name' --format=%cs | sort | head -1

git log --author 'Name' --format=%cs | sort | head -1

dist-epoch · on Oct 10, 2023

How much bigger (in lines of code) is Postgres now versus the one from 15 years ago?

Maybe it was more approachable for a 22yo then, you could figure out more of it.

Also, C was a standard language back then, today the kids are more likely to program in Rust than in C.

anarazel · on Oct 10, 2023

> How much bigger (in lines of code) is Postgres now versus the one from 15 years ago?

I was curious as well and wrote a, very crude, script to measure it:

  for t in $(git tag -l|grep -E 'REL.*_0$|REL[67]_[0-4]$'|grep -v REL2);do echo -ne "$t\t"; git ls-tree -r $t --object-only |xargs git show |grep -a -v '^\s+$'|wc -l;done
  REL6_1          270033
  REL6_2          320297
  REL6_3          386532
  REL7_0          630771
  REL7_1          843219
  REL7_2          986991
  REL7_3          1363668
  REL7_4          1492418
  REL8_0_0        1649775
  REL8_1_0        1702325
  REL8_2_0        1806170
  REL8_3_0        2017685
  REL8_4_0        1924918
  REL9_0_0        2011704
  REL9_1_0        2225796
  REL9_2_0        2290872
  REL9_3_0        2405598
  REL9_4_0        2487304
  REL9_5_0        2527906
  REL9_6_0        2632559
  REL_10_0        2534653
  REL_11_0        2771914
  REL_12_0        2697892
  REL_13_0        2822066
  REL_14_0        2980221
  REL_15_0        3054963
  REL_16_0        3351147

This is counting non-empty lines. It's definitely not a good measure of overall code size, as it includes things like regression tests "expected" files. But as that's true for all versions, it should still allow for a decent comparison.

8.3.0 was released 2008-02-01, with 2M non-empty lines, we're now at 3.4M.

pcthrowaway · on Oct 10, 2023

I suspect you'd get much more useful results by checking out the version tags and running `cloc` - https://github.com/AlDanial/cloc

monkchips · on Oct 16, 2023

great contribution here from Craig, in terms of the ebbs and flows and useful history. i had no idea about that cluster of folks under 22 with commit bits.