> "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
SQL has nothing to do with fast. Not sure what makes it any more testable than polars? Future-proof in what way? I guess they mean your SQL dialect won't have breaking changes?
I’m also a duckdb convert. All my notebooks have moved from Pandas and polars to Duckdb. It is faster to write and faster to read (after you return to a notebook after time away) and often faster to run. Certainly not slower to run.
My current habit is to suck down big datasets to parquet shards and then just query them with a wildcard in duckdb. I move to bigquery when doing true “big data” but a few GB of extract from BQ to a notebook VM disk and duckdb is super ergonomic and performant most of the time.
It’s the sql that I like. Being a veteran of when the world went mad for nosql it is just so nice to experience the revenge of sql.
I personally find polars easier to read/write than sql. Especially when you start doing UDFs with numpy/et. al. I think for me, duckdb's clear edge is the cli experience.
> It is faster to write and faster to read
At least on clickbench, polars and duckdb are roughly comparable (with polars edging out duckdb).
I use them both depending on which feels more natural for the task, often within the same project. The interop is easy and very high performance thanks to Apache Arrow: `df = duckdb.sql(sql).pl()` and `result = duckdb.sql("SELECT * FROM df")`.
For the bigger tasks, Exasol might also be a very neat option for you. We have a free personal edition that can scale regarding data volumes, #servers (MPP architecture) and complex workloads.
Recently, we have also compared ourselves against DuckDB and were 4 times faster even on a single node. We are in-memory optimized, but data doesn't need to fit in the RAM.
Author here. I wouldn't argue SQL or duckdb is _more_ testable than polars. But I think historically people have criticised SQL as being hard to test. Duckdb changes that.
I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines. I've seen this through the past five years of writing and maintaining a record linkage library. It generates SQL that can be executed against multiple backends. My library gets faster and faster year after year without me having to do anything, due to improvements in the SQL backends that handle things like vectorisation and parallelization for me. I imagine if I were to try and program the routines by hand, it would be significantly slower since so much work has gone into optimising SQL engines.
In terms of future proof - yes in the sense that the code will still be easy to run in 20 years time.
> I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines.
Yeah, but SQL isn't really portable between query all query engines. You always have to be speaking the same dialect. Also, SQL isn't the only "declarative" dsl, polars's lazyframe api is similarly declarative. Technically Ibis's dataframe dsl also works as a multi-frontend declarative query language. Or even substrait.
Anways my point is that SQL is not inherently a faster paradigm than "dataframes", but that you're conflating declarative query planning with SQL.
> "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
SQL has nothing to do with fast. Not sure what makes it any more testable than polars? Future-proof in what way? I guess they mean your SQL dialect won't have breaking changes?