Spark is a cancer. Sooner or later, 99.9% of the people using Spark will wake up...

nautilus12 · on Oct 21, 2020

People that take this attitude against distributed processing usually have never had to process an amount of data bigger than would fit on one machine or always have the budget to fit everything on one very expensive large machine. It's lack of experience masquerading as cleverness. The only ones that have a right to make this arguments are the people that spread all their processing out as streams on one machine or using distributed streams but even that has serious limitation.

matteuan · on Oct 21, 2020

If there was something better than Spark for distributed processing, we would be using it. The rest of your comment is a straw man argument, assuming everybody uses it for datasets fitting in memory of a single node.

gberger · on Oct 21, 2020

What do you recommend for distributed data processing?

MrPowers · on Oct 21, 2020

Dask is a great alternative for distributed computing as well: https://github.com/dask/dask

IMO, Spark is better for some tasks and Dask is better for others.

peteradio · on Oct 21, 2020

First step is decide if you really need distributed data processing. I think this is the point author is making. I've seen GB sized data considered "BIG DATA" and its unbelievable the architectural patterns used to support this "BIG DATA".