This isn’t the full story though, like I (professionally, as a consultant) analyzed GOPs/$ and /Watt for big multi chip GPU or FPGA systems from 2006-2011.
Xilinx routinely had more I/O (SerDes, 100/200/400G MACs on-die) and at times now more HBM bandwidth than contemporary GPUs. Also deterministic latency and perfectly acceptable DSP primitives.
The gap has always been the software.
Of course NVidia wasn’t such an obvious hit either, the flubbed the tablet market due to yield issues and ultimately it really only went exponential in 2014. I invested heavily in NVidia 2007-2014 because of the CUDA edge they had, but sold my $40K of stock at my cost-basis.
I currently do DSP for radar, and implemented the same system on FPGA and in CUDA 2020-2023. I know as a fact that the FFT performance of an $9000 FPGA was equal to a $16000 A100 that also needed a $10000 computer in 2022 (the types on FPGA were fixed point instead of float so no apples-to-apples but definitely application equivalent)
I think you are making the mistake of thinking that xilinx software can fix the programmability of their hardware. it cannot. If you have to solve a place and route problem or do timing closure in your software, you have made a design mistake in your hardware. You cannot design hardware such that a single FFT kernel takes 2 hours to compile and then fails, when nvcc takes 30 seconds and will always succeed. You have taken your software into the domain of RTL design. This is a result of the hardware design. Xilinx could have made their versal hardware a programmable parallel processor array that is cache coherent, where everyone has access to global memory. It fundamentally isn't that though. it's a bizarre data flow graph systemn that requires dma engines and a place and route, and a static fabric configuration. That's a fault of your hardware design!
Xilinx routinely had more I/O (SerDes, 100/200/400G MACs on-die) and at times now more HBM bandwidth than contemporary GPUs. Also deterministic latency and perfectly acceptable DSP primitives.
The gap has always been the software.
Of course NVidia wasn’t such an obvious hit either, the flubbed the tablet market due to yield issues and ultimately it really only went exponential in 2014. I invested heavily in NVidia 2007-2014 because of the CUDA edge they had, but sold my $40K of stock at my cost-basis.
I currently do DSP for radar, and implemented the same system on FPGA and in CUDA 2020-2023. I know as a fact that the FFT performance of an $9000 FPGA was equal to a $16000 A100 that also needed a $10000 computer in 2022 (the types on FPGA were fixed point instead of float so no apples-to-apples but definitely application equivalent)