It's not only possible, it's even not uncommon for a C programmer to get a 90X improvement in speed in their own C program. If you have naive memory management, or incorrectly implemented concurrency or parallelism, you can easily lose 2 orders of magnitude speed.
Haha :) Maybe the shipped a better product, but the management said "No, it's not possible that this could run that fast. Something must be wrong.", so they put in some "waiting".
Because the CPU has 32k of cache in this case (ARM) so the memcpy was evicting the entire cache several times in the loop as a side effect of doing the work. The actual function of the loop had good cache locality as the data was 6 stack vars totalling about 8k.
So? Copying a megabyte is a really expensive thing to do inside a loop, even ignoring caches. (A full speed memcpy would take 40 microseconds, based on a memory bandwidth of 24 GB/s, which is a long time.)