Typically, making it possible for the compiler to decide whether or not to inline a function is going to make code faster compared to disallowing inlining. Especially for functions like strcpy which have a fairly small function body and therefore may be good inlining targets. You're right that there could be cases where the inliner gets it wrong. Or even cases where the inliner got it right but inlining ended up shifting around some other parts of the executable which happened to cause a slow-down. But inliners are good enough that, in aggregate, they will increase performance rather than hurt it.
> Even with LTO today you're talking 2-3% overall improvement in execution time
Is this comparing inlining vs no inlining or LTO vs no LTO?
In any case, I didn't mean to imply that the difference is large. We're literally talking about a couple clock cycles at most per call to strcpy.
What I was trying to point out is that you're essentially talking about LTO. Getting into the weeds, the compiler _can't_ optimize strcpy(*) in practice because its not going to be defined in a header-only library, it's going to be in a different translation unit that gets either dynamically or statically linked. The only way to optimize the function call is with LTO - and in practice, LTO only accounts for 2-3% of performance improvements.
And at runtime, there is no meaningful difference between strcpy being linked at runtime or ahead of time. libc symbols get loaded first by the loader and after relocation the instruction sequence is identical to the statically linked binary. There is a tiny difference in startup time but it's negligible.
Essentially the C compilation and linkage model makes it impossible for functions like strcpy to be optimized beyond the point of a function call. The compiler often has exceptions for hot stdlib functions (like memcpy, strcpy, and friends) where it will emit an optimized sequence for the target but this is the exception that proves the rule. In practice, the benefits of statically linking in dependencies (like you're talking about) does not have a meaningful performance benefit in my experience.
(*) strcpy is weird, like many libc functions its accessible via __builtin_strcpy in gcc which may (but probably won't) emit a different sequence of instructions than the call to libc. I say "probably" because there are semantics undefined by the C standard that the compiler cannot reason about but the linker must support, like preloads and injection. In these cases symbols cannot be inlined, because it would break the ability of someone to inject a replacement for the symbol at runtime.
> What I was trying to point out is that you're essentially talking about LTO. Getting into the weeds, the compiler _can't_ optimize strcpy(*) in practice because its not going to be defined in a header-only library, it's going to be in a different translation unit that gets either dynamically or statically linked.
Repeating the part of my post that you took issue with:
> If there was only one implementation of strcpy and it was the version that happens to be picked on my particular computer, and that implementation was in a header so that it could be inlined by my compiler, my programs would execute faster.
So no, I'm not talking about LTO. I'm talking about a hypothetical alternate reality where strcpy is in a glibc header so that the compiler can inline it.
There are reasons why strcpy can't be in a header, and the primary technical one is that glibc wants the linker to pick between many different implementations of strcpy based on processor capabilities. I'm discussing the loss of inlining as a cost of having many different implementations picked at dynamic link time.
> Even with LTO today you're talking 2-3% overall improvement in execution time
Is this comparing inlining vs no inlining or LTO vs no LTO?
In any case, I didn't mean to imply that the difference is large. We're literally talking about a couple clock cycles at most per call to strcpy.