I've written my own strlen equivalent and benchmarked them against default on different compilers, processors and environments, and they almost always are faster or the same speed.
Default libs are sometimes very optimized but very often they are not, unfortunately.
If you care about performance, you should not rely blindly on the the defaults.
A long time ago I wondered about the performance of memcpy on the Nintendo DS, for sure they would have provided a hand optimized version? And yes, it was handcrafted ARM assembly code, but my own version turned out to be twice as fast.
They simply forgot to use a simple prefetching trick in their implem.
Nintendo DS has probably had a lot less scrutiny than a major libc or recent GCC or clang [though you can probably target its ARM processor with that]. Also, for an older embedded platform they may choose to do optimization for code size rather than cycles or clock time.
I'm going to have to doubt the start of your comment. Having seen a lot of libc implementations I think you are better off not wasting time optimizing strlen. Also memcpy, probably memcpy moreso. Most memcpy()s I've seen in the current century are using SIMD instructions and the like. And compilers don't even bother emitting a call to libc for it anymore, they do it as a builtin.
On the contrary, I expected the Nintendo DS SDK to be well optimized, performance of memcpy can be critical on such a constrained hardware. And it was optimized, just not with the best tricks.
I got the prefetching trick from Intel source code, except that I replaced the PLD instruction by a simple dummy load.
And about strlen, you'd be surprised, some implems are very good, and some are not, depends on the compiler and the library. I've ran benchmarks, I was surprised too.
To be honest, I don't really need super fast strlen, but I was curious and also learning to write fast SIMD code, basic string handling is a nice exercise.
I think this expectation doesn't vibe with my understanding of how people used to think about embedded or consoles. You shipped them and they were done. The games industry was also often trying to ship quickly. Small teams too. Latest tweaks to memcpy or fine tuning or revisiting the finer points of an already adequate SDK is low priority.
By contrast, many more people are updating optimizations to GCC or clang for arm, more frequently and over a longer timeframe.
You're probably right about consoles, and I was surprised to be wrong, but I checked, just to be sure.
GCC and Clang are very nice compilers, but they are a different thing than the std lib. glibc, musl, the Windows C Runtime, iOS, Android, all have different implementations, sometimes outdated.
Of course I have data, do you think I am pulling benchmarks out of a hat?
But I have not published those benchmarks, if this is what you're asking, the Nintendo thing I am afraid I cannot reproduce easily as I no longer have this devkit on hand.
About the strlen benchmark, this is something I've done a few years ago, that could be easy to run again, but I am not sure this is worth the effort just to convince a random dude on the internet...
Default libs are sometimes very optimized but very often they are not, unfortunately.
If you care about performance, you should not rely blindly on the the defaults.
A long time ago I wondered about the performance of memcpy on the Nintendo DS, for sure they would have provided a hand optimized version? And yes, it was handcrafted ARM assembly code, but my own version turned out to be twice as fast.
They simply forgot to use a simple prefetching trick in their implem.