Yeah, MLFlow is a shitshow. The docs seem designed to confuse, the API makes Pandas look good and the internal data model is badly designed and exposed, as the article says.
But, hordes of architects and managers who almost have a clue have been conditioned to want l and expect mlflow. And it's baked into databricks too, so for most purposes you'll be stuck with it.
Props to the author for daring to challenge the status quo.
I have never seen a worse documented library. Initially I thought that they were lazy, now I realize that it cannot be documented because it is a total mess of a library held together with tape.
Docstrings are one thing, but functionality discovery, picking up from scratch, troubleshooting, etc are... not fun, nor easy with the documentation. If you know it well already and use it a lot it's easier to forgive its documentation faults since you can waive off the problems as "that's just learning something new".
But for a lot of people who use it infrequently its documentation is a frustrating mess. Simple problems turn into significant time sinks of trying to find which page of the documentation to look at.
A lot of issues are made worse by shit-awful interop between libraries that claim to fully support dayaframes, but often fail in non-obvious ways... meaning back to the documentation mines.
I'd argue that because there's a market for a single author to write two books about it is indicative of documentation problems.
Fair enough. I'm highly biased and my recent book is the most popular Pandas book currently, so it is evidence that folks prefer opinionated documentation.
However, I always though the 10 minutes to Pandas page was decent for getting started. I picked up Polars recently and thought it was more difficult than Pandas because there wasn't any quick intro docs. What projects have great introductory docs for you?
Also, I am curious to learn more about the specifics of interop libraries you are referring to.
Learning a new tool is generally a challenge. I think another challenge with a lot of data tools is that non-programmers tend to be the major audience. I make my living teaching "non-programmers" how to use these tools.
That said, I always teach "go to the docstrings and stay in your environment (to not break flow) if you can." The pydata docstrings are better than most, including Python (the language).
Yeah, I think for your audience, pandas makes total sense! When I first started using it, it was through an ambitiously large project with tons of gaps in data, untype-able text for 1% of rows, didn't fit in memory.. etc. So my personal experience is a bit tainted by putting myself through a hell that could have solved sooner by spending more time learning instead of bashing my keyboard with a hammer.
I've long suspected that Pandas has taken a similar stance to e-mail scammers. Where e-mail scammers inject all kinds of broken english and bad punctuation to ensure they get their targets of choice, Pandas has broken and often inaccurate documentation in order to get only the chosen ones to work with their software.
However, maybe it makes more sense that it's just a mess that's hard to document.
The Pandas documentation has improved quite a bit. Last I checked, the only part of the reference docs with a big gap was the description of "extension arrays" and accessors.
The user guide material absolutely needs work, and the examples in the reference docs tend to be a little contrived. But I absolutely have seen worse-documented libraries, such as Gunicorn and Pydantic.
I'm surprised to see Pydantic in here; I've used Pandas and Pydantic both quite a lot, and have found the Pydantic docs to be quite good! Also a much smaller library with a saner API, and thus easier to document well.
What makes the documentation so bad in your opinion? I’m not arguing but curious since I use pandas all day at my job and can’t think of any times the docs weren’t clear to me. (Plotly I have had some annoying times with!)
What bothers me the most is the egregious data types for any argument.
If it's a string, do this. If it's a list, do that. If it's a dictionary of lists, do this other thing.
No, I want you to force me to provide my data in the right way and raise a noisy exception if I don't.
Series and DataFrame have "alternate constructors" for this purpose, and the loc/iloc accessors give you a bit more control.
I agree that the magic type auto-detection is a bit too magical and sloppy, but you have to realize that data analysts and scientists have historically been incredibly sloppy programmers who wanted as much magic as possible. It's only in recent years that researchers have begun to value some amount of discipline in their research code.
Every time I open up pandas I jealously remember the expressive beauty of R for these tasks. But because we're all "serious" of course we must use Python for production lest we not be serious.
R is a trash of a language. It doesn't have any sense of coherency to it at all.
They keep trying to fix the underlying problems by ducktaping paradigms on to it over and over (S3, S4, R6, etc). There's never a clear sense of the best way to do anything, but plenty of options to do a thing in a very hacky 'script-kiddy' way.
Looking out at the community of different projects it becomes clear that everyone is pretty lost as to what design principles should be used for certain tasks, so every repo has its own way of doing things (I know personal style occurs in other languages, but commonalities are much less recognizable in R projects).
It's tragic that such a large community uses it.
Trash language is a bit harsh. I'm not sure I would try to put an R project into production or build a huge project with it but, at the very least, R/R Studio was the best scientific calculator I've ever used. Was particularly great during college
Yep, this is a mark of someone that's never used R but has heard a lot of incredibly ill informed criticism around it.
One look of dplyr code over pandas would of course disabuse anyone of the notion that R is trash and the tragedy is Python will in the current state never have anything like that. That's the advantage of the language being influenced by Lisp vs not.
I agree that it is a trash language and that, outside that many frontier academic ideas are available and some plotting preferences are solidly prescriptive, it should be thrown into the trash bin.
Python, Julia when it gets its druthers for TTFP, Octave, Fortran, C, and eventually Rust. These are the tools I've found in use over and over and over again across business, government, and non-profits.
Everywhere R is used by the org I have seen major gaps in capacity to deliver specifically because R doesn't scale well.
I'm not emotionally invested in tools so am happy to identify the user experience and operational experience as "trash."
"Trash", despite its connotations of lacking value, is really just a chaotic disorganized mess of something made by artifice with dubious reclaim/reuse/recycle value. Being a subjective assessment, it is natural that one person's trash is a treasure to another.
I take issue with your implication that I'm emotionally invested in something when I shouldn't be. You are free to dislike R and not use it, but to claim that it's "trash" is to wrongly disavow its usefulness for the many people that do find it useful, and to cast aspersions on the judgement of all those people.
Hey, I apologize here, my point on emotional investment was that I, personally, am not emotionally invested in it and did not mean to cast aspersions at you for your defense of the language nor at people who have preferences for it. Specifically I meant that I'm comfortable enough in my understanding of the language to classify it and it's standard library as better in the garbage bin relative to alternatives available.
It's fine that people like it. What's good about it isn't unique, and what's unique about it isn't that great. And there are certainly switching costs for some orgs to consider.
its forced upon many of them that are in finance, banking, insurance, ...
Mainly because those tend to run on Microsoft Azure, which has no decent analytics offering, and are pushing Databricks extremely hard. The CTO or whatever just pushes databricks. On paper it checks all the boxes. Mlops, notebooks, experiment management. It just does all of those things very badly, but the exec doesn't care. They only care about the microsoft credits.
Just to avoid using Jupyter so the compliance teams stay happy as well because Microsoft sales people scared them away from from open source.
We pushed back on it very, very, very hard, and finally convinced "IT" to not turn off our big Linux server running JupyterHub. We actually ended up using Databricks (PySpark, Delta Lake, hosted MLFlow) quite a bit for various purposes, and were happy to have it available.
But the thought of forcing us into it as our only computing platform was a spine-chilling nightmare. Something that only a person who has no idea what data analysts and data scientists actually do all day would decide to do.
What would you go with instead for collaborative notebooks?
I ask because normally I tend pretty strongly towards the "NO just let the DSes/analysts work how they want to", which in this case would be running Jupyter locally. However DBr's notebooks seem genuinely useful.
Is your issue "but I don't need Spark" or "i wanna code in a python project, not a notebook?", or something else?
Imo if DBr cut their wedding to Spark and provided a Python-only nb environment they'd have a killer offering on their hands.
> What would you go with instead for collaborative notebooks?
Production workloads should be code. In source control. Like everybody else.
Notebooks inevitably degrade into confusing, messy blocks of “maybe applicable, maybe not” text, old results and plots embedded in the file because nobody stripped them before committing and comments like “don’t run cells below here”.
They’re acceptable only as a prototyping and exploration tool. Unfortunately, whole “generation” of data scientists and engineers have been trained to basically only use notebooks.
It's ubiquitous. I've consulted for a 100 person company that built a data product on top of some IoT data. Everything was in databricks, literally everything. (Not endorsing that, just an observation)
Talking to a 2000+ person org now that is standardizing data science across the org using... you guessed it
Pretty interesting. I think this is part of this notion to release half baked products, like some of the stuff in there are really cool, just enough to get you in but it doesn't scale and usually is complex to deploy/use.
But, hordes of architects and managers who almost have a clue have been conditioned to want l and expect mlflow. And it's baked into databricks too, so for most purposes you'll be stuck with it.
Props to the author for daring to challenge the status quo.