Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't understand the point of this function, when Excel already has Power Query. It doesn't seem like anyone who is literate in functional programming would want to use this, and anyone who isn't up to it wouldn't either.

One of the most annoying things about Excel is it has so many parts apparently designed by people or groups that didn't talk to each other and didn't have a grasp of all the rest of it, let alone the world of the (various groups of) users.

Who ordered another Turing-complete system in Excel? One that is, like all the others, a pain and a half to debug or analyze? Has anyone figured out how to turn this into a security vulnerability yet?

Saying "yay people are making videos" only makes me think of all the horrific tutorials on Power Automate. And this: https://xkcd.com/763/



The fact that this is part of the main formula language and not some bolted-on things means that is it somewhat incremental. Apply more of the incremental data log type research, and this really could be something neat. (The "grid calculus" would seem to indicate better linear algebra and maybe even tensor things are on the way, too!

I hope this gets implemented in libreoffice too; I will certainly tell non-programmers to stop using python or whatever and go back to spreadsheets!


> I don't understand the point of this function, when Excel already has Power Query.

Because Power Query is not a spreadsheet application, and has some much more severe performance cliffs than Excel proper does.


To you and easton, my point is that even if Power Query has shortcomings, it's clearly the best thing to build on and improve, assuming VBA is dying a slow death and can't be revived. Even if, like, you wanted to make another separate language, it should still resemble Power Query, only better.

I don't think people at Microsoft are looking at Excel as a whole, like lost souls squatting in a mansion and building sand castles in the room that they live in that have no relationship to the actual building and what it needs to keep from falling down.

I'm not sure what you mean by performance cliffs. Can you give an example of where and how you would better accomplish something without Power Query? Are you talking about processing data in the range of a few hundred megabytes?


PowerQuery isn’t a replacement for excel though - it’s a data preprocessing tool for analysts.

It’s not going to replace functionality of core spreadsheet-based excel for accountants, for instance, who typically won’t have a use for PowerQuery as their data is structured differently.


I am not an accountant per se, but I work for an accounting organization. Could you give an example of what you mean by "structured differently"?


PowerQuery assumes tabular data where each column is the data type and each row is a data element / entity. It is structured similar to a database.

In a spreadsheet the data is much less structured which is where a lot of the power comes from - for instance PowerQuery doesn’t really support things like subtotals easily, or doing scratch-calculations, or building quick financial models. It is closer to a paper-ledger with calculations scribbled into the margins than a big-data database.

PowerQuery is more about ingesting lots of data and cleaning it, while finance is often about working stuff out and playing with numbers to see what happens - and playing with numbers is easier in a less-structured-loosely-typed environments.


>PowerQuery doesn’t really support things like subtotals easily,

Subtotals? I was used to using GROUPING SETS with Oracle SQL, and found I could roll my own in Power Query. It's a good example of exactly why I like it.

Also, Power Query doesn't prevent you from using the regular table total feature or a pivot table based off of the Power Query output.

That is, even if Power Query doesn't provide all the subtotaling features you'd like in the way you'd like, it doesn't restrict you from anything, does it?

> or doing scratch-calculations, or building quick financial models

I do use it to do all sorts of ad hoc calculations - for instance, it can ingest PDF files or HTML with tables.

It sounds as if you're saying it's too complicated for really trivial calculations?


> It sounds as if you're saying it's too complicated for really trivial calculations?

I'm saying it's not the right tool for some classes of calculations.

For instance I work in designing warehouses, and use both tools. Here are some use cases where Excel doesn't do well and I would use PowerQuery:

* Ingesting millions of historical orders

* Handling relational data

* Data cleaning and aggregations

Here are some example use cases where PowerQuery doesn't work as well, but Excel is perfectly good:

* What height should the pallet racking bays be in this warehouse, and how many pallets am I likely to fit in the building envelope? (considering my other space requirements)

* What's the likely transport impact of opening a new distribution point?

* Running lots of scenarios or sensitivities.

Why are these better in excel? Well there are just some things PowerQuery doesn't do well, for instance excel can take into account any other arbitrary cells value into it's own calculation, while in PowerQuery you generally have to use an intermediary table and joins to handle this.

Can both tools physically do it? Yes, it's just some problems suit one rather than the other, and identifying the right tool for the right problem saves you lots of time. One thing that makes Excel better for scratch calculations for example is the fact that it's a live environment (with PowerQuery you have to run it after changes to get the results back, and this can be really slow compared to excel).


Power Query's streaming semantics for tables and lists can lead to severe performance issues with even modest data volumes. Table.Buffer and List.Buffer offer some small amount of control, but it's likely that you have a pipeline that creates a series of intermediate table and/or list values. Every single table and list function (with the exception of the buffer functions mentioned above) creates a new lazy stream.

Accumulation patterns perform abysmally even with data in the 100Ks of elements. Say you have a table of inventory movements and want instead a snapshot table of inventory at point in time. You can do an O(n^2) self-join of a table with itself to all records with a lesser date, summing all movements to derive a total quantity at that time.

If you want to use an accumulation pattern, you can sort and cast your table to a list of records and then use List.Accumulate to iterate over each list element, deriving a new field with the running total of inventory amount. If you do this, you will find that it falls right over even with 1Ks or 10Ks of records. This is because the intermediate list that you're appending to through the accumulation is itself a lazy stream. Thus, you have to use List.Buffer at each step. Even with List.Buffer at each step, this solution falls over at high 10Ks or low 100Ks of records.

Incredibly unintuitively, you can use List.Generate with an already-buffered input list to derive a new list that can then be cast back to a table, though this still struggles with 100Ks of records.

If your snapshots can be aggregates, then you can happily throw out the idea of such an accumulation pattern and just join to a date table at the appropriate grain with all movement records less than or equal to the date in that date table.

I'll note that I regularly speak with several of the people whose blogs you will inevitably come across when performance tuning Power Query. The approaches above are the current state of the art in PQ for iteration and accumulation patterns. This is not an appeal to authority or a brag. This is to highlight the difference with the Excel spreadsheet formula approach below, which even beginners can derive from first principals.

In an Excel spreadsheet, for the same challenge, you just define a new column with a special first row formula, and each subsequent cell referencing the row above. This will happily run right up to the spreadsheet row limit with no performance concerns. If you really want, you can spill over to multiple spreadsheets, which is clunky to manage, but still performs just fine, and degrades slowly. The M approaches above hit a cliff and start hanging.

Excel formulas make it trivial to reference arbitrary cells. M is a nearly-general purpose language. PQ uses M, but as a framework for writing M, it has a strong emphasis on a query/table paradigm. A table-based operation model cuts against the grain of a spreadsheet, because a spreadsheet is a collection of arbitrary cells. A tabular approach is a collection of similarly shaped records stacked one upon the other. These two paradigms have a fair amount of overlap, but are not isomorphic. There are things trivial to express in one that become difficult bordering on impossible in the other.


As someone developing essentially a competitor to Excel-and-PowerQuery/M, I find all this very interesting.

My language is strict and statically typed. However, after arrays (tables are arrays of records conceptually) exceed a certain length, rather than processing them in-memory as arrays, they will be offloaded to storage and processed (transparently) in a streaming fashion.

I’m surprised that this doesn’t work well in PowerQuery. I would have thought that 100K would be peanuts for it.

Mine is a SaaS however, so the user’s laptop isn’t a constraint, and I can transparently throw a million records in BigQuery or some other data warehouse and use its aggregates if needed. Although at the 100K scale you can use SQLite and it can handle that scale of data trivially on commodity laptops.

So your experience is interesting indeed.


Feel free to reach out via email if you want to follow up. My address is in my profile.

I'll note, as I did to a sibling reply of yours, I made observations about a specific pattern that showcases performance issues in PQ/M. PQ/M easily scales beyond 100Ks of records, but not for arbitrary processing patterns.


I'm skeptical. I want an example, because my experience differs.


Power Query falls over with thousands of items?

That's not my experience. At work, the data usually isn't very large, but I have experimented on my own time with, for instance, a public covid data file that I think was several GB.

I also thought lazy semantics is a good thing, not a fundamental flaw.

Rather than debate, I would be interested enough to spend some time on a sample problem, if you could provide one, where you believe Power Query inadequate, and at the same time have an alternate solution to provide a benchmark of what is adequate.


Not "PQ falls over with 1Ks of items," but rather "the M language does not do well with accumulation patterns on tables; naive approaches can hit significant performance issues in the 1Ks of records and sophisticated approaches struggle with 100Ks."

These are two very different statements. I've happily used PQ to ingest GBs of data. Its streaming semantics are fine to great for some types of processing and introduce performance cliffs for others. There's no binary judgment to be made here. Laziness is neither a fundamental flaw not an unmitigated good.

I've already shared one specific pattern above. I can share some mocked up data if you need me to, but that might be a day or two. Also, feel free to reach out via email (in my profile).


>I've already shared one specific pattern above

If you mean this:

"Say you have a table of inventory movements and want instead a snapshot table of inventory at point in time"

Then I can make my own data to play with - I only want to be clear about the constraints. Would 500K records be enough to obviate the distinction between naive and non-naive approaches? Can you quantify (not precisely) "struggle"?

I have used Table.Buffer, but I probably don't thoroughly understand its use yet.

(I belatedly realized your problem is something I've done with Sharepoint list history recently, but not that many records, so I'm going to look for a public dataset to try)

P.P.S. I guess it also makes me think - I frequently am getting my data from an Oracle database, so if something is easier done there, I'd put it in the SQL. Analytic functions are convenient.

P.P.P.S. Aha! I found a file of parking meter transactions for 2020 in San Diego, which is about 140MB and almost 2 million records. This seems like a good test because not only is it well over the number you said was problematic, but it's well over the number of rows you can have directly in one Excel sheet.

https://data.sandiego.gov/datasets/parking-meters-transactio...


Ok, I agree that PQ is slow. It is possible to calculate a running total of a column in a million row table before the sun burns out though.

I am very not an algorithm person, but I got a huge speedup from a "parallel prefix sum" instead of the obvious sequential approach or the even worse N^2.

I translated this to M by rote and trial and error (page 2): https://www.cs.utexas.edu/~plaxton/c/337/05f/slides/Parallel...

Implementing the parallel, recursive solution got me a million rows in about three and a half minutes.

Fill down (which I had to do anyway to compare) was about 10 seconds.

So...probably not the first choice in this scenario but could be worse?


Nobody at MSFT talks to their counterparts. I cringe every I have to copy and paste between Teams, Outlook and OneNote. There have to be internal customers there with the same problems.


One that threw me for a loop recently was the spell check dictionary and suggestions.

Mess up in Outlook, you right click and it gives a couple suggestions. It'll call out the typo right after you finish the word.

Mess up in Teams? It'll wait until you finish the next word (charitably, giving you a second to figure it out?) then will suggest a different word than Outlook would.


It seems like they are long past the days of a CEO being able to say “Make this stuff work together.”

I’m still impressed with their renewal as a company. Rare for a stagnant tech firm to come back.


The truth is you can basically already do this in excel without involving any more complicated things like VBA etc, purely in spreadsheet. You just need a lot more cells and longer formulas to set it up. Lots of Vlookups. This will provide the people that are already doing this stuff some much appreciated shortcuts.


You definitely don't want to write anything complicated with it, but I think it is a nice intermediate that is more secure than macros and also is closer to writing Excel formulas than implementing custom macros.

One thing that would greatly improve the experience would be to allow for formulas to contain just a lambda and then reference that lambda from another cell as a cell reference. Currently you have to use manage lambdas under Formulas > Name Manager. This would make debugging a lot easier in my opinion so that you can freely mix data entry with computation. Not sure why they haven't done this already, but I suspect it is because of assumptions baked into Excel.


> allow for formulas to contain just a lambda and then reference that lambda from another cell as a cell reference

My pet project from a couple of years ago[1] had cells-as-functions. I think it works really well. I also think names are important, but yeah they should either be easy or optional. Glad the Excel folks liked my rad idea though, even if they didn't quite hit all the high notes :-).

1: https://6gu.nz/, IMO worth watching the first minute of the video to see it in action


Power Query is only a thing in Excel for Windows though, it’s likely it will eventually be replaced with something cross platform. And if you don’t want to make .NET load every time you open your workbook, you have this.


That doesn't sound like a reason to me. What in the fundamental nature of Power Query prevents it from being cross platform? Or, conversely, what prevents something new that is cross platform from working basically the same as Power Query if it must be strictly incompatible?

It's possible we're not talking about the same thing. Microsoft has slapped "Power" on so many different things. When I google "Power Query" I get a lot of "Power BI" stuff and I try to avoid that like the plague. In my limited experience, it's flaky, unstable, and adds negative value to my reports.

From my perspective, Power Query appears to be similar to the scripting language in something like Qlikview. Except much less painful (for me). I also think "grokking" Power Query could lead to improving SQL, even. The split between SQL and things like PL/SQL or T-SQL always felt wrong to me. Just having functions as a seamless part seems like the thing that was always missing.



I think it’s mostly Power Query (the thing in excel that pops up when you click “Get and Transform Data” is what I’m talking about) requires .NET 4 at the moment, which means that they’d need to get it on .NET Core to get it on the Mac, and then would still have no way to get it on the web and on mobile. It’s a serious pain point for cross platform compatibility in Excel at the moment, but I concur with you that we need to keep it or something similar around (especially since Microsoft keeps dragging their feet on adding first class support for a scripting language that isn’t VBA).


I just clicked on the XKCD link and chuckled a bit. Then I thought "Hey, I wonder if there's a new XKCD".

https://xkcd.com/2453/ Wouldn't you know it?


This is going to make writing spreadsheets to do basic engineering calculations far more clear.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: