This was prompted by unexpectedly finding that in Python 2.7.4, the fragment: a2...

barrkel · on Sept 5, 2018

You're so far away from the hardware when you're in an interpreted environment, reasoning about performance is fraught with danger. Happy paths could be an order of magnitude faster than very similar but just different enough code.

Usually, in an interpreted environment, I reason about other things: number of database calls, HTTP fetches, that kind of thing. If I need CPU performance, I'll fire up something faster.

Finding the meatiest primitive that does the job in one go is often best - batching, where the batch operation is implemented in native code - but not always, you're at the mercy of the implementer.

snovv_crash · on Sept 5, 2018

If you have performance bottlenecks, Python is the wrong language. The only place where this kind of investigation makes sense is if you're a Python VM dev.

ColinWright · on Sept 5, 2018

> The only place where this kind of investigation makes sense is if you're a Python VM dev.

You are mistaken - this investigation does make sense in my context.

snovv_crash · on Sept 7, 2018

If you're having Python bottlenecks, it is the wrong language. First try Pypy, and if that doesn't/can't work switch to Julia/Go/R/C++ as your domain requires.

ColinWright · on Sept 8, 2018

You are right when you say:

> If you're having Python bottlenecks, it is the wrong language.

But you are missing the point. The object of the underlying exercise was not to try to make this specific instance of the algorithm run fast, Python was not the bottleneck. The activity was exploring the effects of algorithmic changes and investigating the Big-O performance of the different variants and changes. As a part of that the underlying code was being reworked and modified to allow the larger algorithmic changes to be made, and this was one of those changes. It says so in the post:

> Then I made a small change to the code layout ready for another shot at an optimisation.

As always I ran the timing tests (in addition to the functional tests) to make sure nothing substantial had changed, and there was an anomaly which attracted my attention and piqued my curiosity. Hence the write up.

And I believe you are wrong when you say that this sort of thing is, or should be, only of interest to Python VM Devs. I think that curiosity about this sort of thing is, or should be, important to everyone. Perhaps you will then decide you don't have time to pursue it, or that your time is better spent elsewhere, and that's fair enough, but curiosity and a desire to learn has been a pervasive theme among the best devs I've ever worked with.

So again, you are wrong, this investigation does make sense in my context. Your mistake is in assuming the post is about a Python bottleneck. As I say, you've missed the point.

snovv_crash · on Sept 12, 2018

The issue is that this investigation is so biased by what is slow in specifically the Python VM that I'm not sure what the effects would be in a case where performance is actually something you are trying to optimize for. The moment you translate this into C, C++, Java etc you suddenly have a smart compiler that can deal with things like function inlining, to where things like cache sizes and memory streaming speed will present as bottlenecks instead of object to object pointer jumping and bytecode interpreter overhead.

Languages don't get slower on a single dimension, and computers have multiple dimensions of performance. Hitting a performance bottleneck in one dimension on one language doesn't mean you'll hit the same bottleneck on a different language. Different scales might have different bottlenecks as well, think exceeding L2 cache or finally using all your GPU cores on problems that parallelize with scale. If you're using Python you'll never hit those other bottlenecks because the VM is always 40x slower than a naive native implementation.

Unless, of course, your bottleneck is in native code, which is where the easy prototyping of Python really shines.

usefulcat · on Sept 5, 2018

> My guess is that in the first case the two evaluations and assignments can happen in parallel, and so may happen on different cores, whereas in the second case they must happen in serial, but that's just a guess.

That would require using multiple threads, which I'm pretty sure python won't do automagically. And even if it did, trying to do those calculations concurrently would almost certainly be slower due to the additional synchronization overhead.

It would be interesting to know how different python implementations react to that change.

ColinWright · on Sept 5, 2018

I thought that modern CPUs would send instruction streams down multiple cores, if available, and that the interpreter could and would then, in some sections, get executed in parallel for you. Perhaps I was wrong, and the parallelism happens at too fine a scale for the interpreter to get any advantage from it that would be visible for code like this. I'm probably suffering from the hangover of my experiences from 20 years ago writing schedulers on parallel machines for serial code.

toast0 · on Sept 5, 2018

A modern CPU may have multiple cores, each executing a serial instruction stream (or with SMT aka hyperthreading, each core may execute on some sepecific number of instruction streams). Within a core, there may several execution units such as instruction decoders and adders and what not, if you have multiple adders, you can sometimes be processing arithmetic instructions in parallel. A CPU that can do that is a superscalar CPU and the parallelism shown is often discussed as instruction level parallelism.

Instruction level parallelism (on common platforms) is an optimization a processor applies, based on the conditions when an instruction is decided, but it isn't visible directly in the code. You would really need to inspect the low level interpretter loop, with some idea of the specific bytecode it was processing and specifics about the processor you're using. Setting up the order of instructions to reduce data dependencies is one of the things an optimizing c compiler does, to help the processor be able to utilize instruction level parallelism, but to compute things with thread level parallelism would require too much coordination -- starting a thread is a many stepped thing, and then transferring the values is too, even if writing to memory addresses, there's hidden coordination there.

Ysx · on Sept 5, 2018

Red herring perhaps - seems due to the inplace add. It's quicker yet if you take out the multiple assigment:

    $ for STMT in "a2, c = a2+c, c+2" "a2 += c; c += 2" "a2 = a2+c; c = c+2"; do
        echo \[$(python3 -m timeit "a2 = 1; c = 1; $STMT")\] $STMT
    done | sort -n
    [... 0.0547 usec per loop] a2 = a2+c; c = c+2
    [... 0.0569 usec per loop] a2, c = a2+c, c+2
    [... 0.0608 usec per loop] a2 += c; c += 2

ColinWright · on Sept 5, 2018

You're using python3. My brief experiments showed it actually ran faster in python2. You might want to re-run your experiments to see if I've done anything stupid with that test.

But definitely it seemed that python2 and python3 had significantly different performance.

Ysx · on Sept 5, 2018

Oh you're right, that's interesting - have replicated your result.

"a2 = a2+c; c = c+2" is still fastest on both 2 and 3, though don't know enough to say why.

joshuamorton · on Sept 6, 2018

Try python 3.7, iirc it should be faster again

aidos · on Sept 5, 2018

Well that’s....surprising

Edit: did you compare the bytecode?

ColinWright · on Sept 5, 2018

I could, but I would have a lot to learn, and the benefit would be minimal. Better to learn simply to be wary of these sorts of things, and always measure.

I'm hoping someone with a better background will find this sufficiently surprising and be provoked into doing that analysis, writing it up, and letting us know.

I've got it in my boomerang queue.

Bjartr · on Sept 6, 2018

What is a boomerang queue?

ColinWright · on Sept 6, 2018

I use a wiki, email, and a bunch of scripts to organise my projects (short and long term), tasks, reference material, and idle thoughts. As part of that system I use the "Do, Delegate, Defer, Delete" technique for things that come to hand.

So everything that comes to hand gets a unique reference (auto-generated) and I put tags on them. I don't try to be complete, I don't try to make sure tags are unique, I just put what's on my mind and seems relevant in a meta field. The system then auto-cross-references by content and tags, so when I put tags related to stuff I'm working on it uses a sort of TF-IDF system to find relevant documents and other reference material and puts links between them.

But sometimes I don't want to deal with something just yet, so there is a tool that lets me send things into the future - I call that my "Boomerang Queue".

Edit: it's been a while since you asked the question, so I tried to find some way to let you know I've replied, but you have no contact information in your profile. I hope you see this.

Bjartr · on Sept 6, 2018

Thanks for caring that I get this.

I've long considered and wanted a system like you've described. It seems like it would have incredible long-term value. Is everything for yours self-hosted? What wiki do you use?

ColinWright · on Sept 7, 2018

For years I've used the wiki I wrote. That has the feature that links are auto-created from free text, so if a pagename happens to turn up in the text anywhere, a link is made. This requires a little discipline - you don't want to create a page called "And".

But recently I've been experimenting with zim to see if I can use that instead. It saves pages in markdown in a directory structure, so my external scripts can be adapted to run over them and create the augmenting pages. It could possibly insert the free-text links as well in the background. Just an idle thought.

It's all running on my home machine, but I can connect to it from (nearly) anywhere. It's effectively unusable on my phone, so I have a staging area in which I make notes, and then transfer things later. The act of transferring implies a review of the material, and that's of value anyway, so it's not wasted effort.

For years, decades, I've been thinking of polishing, bundling, and releasing it, but it would be one of a gazillion organisation systems and no one would find it. If they did, I'd then be committed to supporting it, so that's not going to happen. I might, one day, convert these replies into a blog post, but again, limited value.

Bjartr · on Sept 7, 2018

Well I appreciate you sharing this with me. It's moved me a little closer towards putting something like this together for myself. The free text linking is an interesting idea.

I worry about not having the discipline to regularly curate such a system. That's probably my biggest roadblock towards just going for it.

ColinWright · on Sept 7, 2018

What I found is that it was important to (a) not over-engineer what I was trying to do, (b) do something that worked for the document I had in my hand[0], and (c) be able to extend what I was doing. If I then "fell off the perch" I could come back to it later, look at the thing in my hand, and put that into the system, thus climbing back on the perch.

Quite quickly the system was useful, and I had fewer decisions to make by using than by not using it, so it saved me effort. That was the key.

[0] ... or file/email/document on the screen

ggggtez · on Sept 5, 2018

When optimizing, don't forget to consider the byte code. Without looking, I'd guess the first one has 1 less assignment operation.