I think that's mostly wrong though, because as the P6 demonstrated complicated CISC addressing modes can be trivially decomposed and issued to a superscalar RISC core.
What really killed 68k was the thing no one here is qualified to talk about: Motorola simply fell off the cutting edge as a semiconductor manufacturer. The 68k was groundbreaking and way ahead of its time (shipped in 1978!), the 68020 was market leading, the '030 was still very competitive but starting to fall behind the newer RISC designs, leading its target market to switch. The 68040 was late and slow. The 68060 pretty much never shipped at all (it eventually had some success as an embedded device).
It's just that posters here are software people and so we want to talk about ISA all the time as if that's the most important thing. But it's not and never has been. Apple is winning now not because of "ARMness" but because TSMC pulled ahead of Intel on density and power/performance.
ISA certainly isn't the most important factor, but your ISA has to be a good enough baseline. History is littered with ISAs that made bad enough choices that were limiting at the time (VLIW, Itanium) or handicapped future generations couldn't (MIPS delay slots).
Arguably x86 and arm are the "RISCiest CISC" and "CISCiest RISC" architectures, and have succeeded due to ISA pragmatism (and having the flexibility to be pragmatic without breaking compatibility) as much as anything else.
People get caught up in intuitive notions of “complex” or “reduced” when talking about RISC and CISC, so it’s very helpful to have specific ideas about what of problems you’re creating for yourself years down the line, when you’re designing an architecture in the 1970s or 1980s. The M68K has instructions that can do these complicated copies from memory to memory, through layers of indirection, and that creates a lot of implementation complexity even though the instruction encoding itself is orthogonal. Meanwhile, the x86 instructions are less orthogonal, but the instructions that operate on memory only operate on one memory location, and the other operands must be in registers. That turned out to be the better tradeoff, long-term, IMO.
Yes, and that was exactly my point (yes, I wrote that SO answer). The indirect addressing mode that were introduced with 68020 are insane. Most people don't know them, as they didn't exist in 68000/68010/68008 and most compilers did not implement them as they were slower to use than using simpler composed addressings.
It is interesting to see what Motorola cut from the instruction set when they defined the Coldfire subset (for those who don't know, the coldfire family of CPU used the same instruction set as 68000 but radically simplified, the indirect addressing methods, the BCD mode, a lot of RMW instructions, etc. were removed. The first coldfire after 68060 which was limited to 75Mhz, ran at up to 300MHz).
BCD modes only take a few gates to implement; their biggest cost is probably the instruction encoding space they occupy. On a chip as small as the 6502 it makes sense that some users would want to avoid the expensive decimal/binary conversions and do arithmetic in decimal.
In modern x86 we have things like the avx scatter gather instructions, which implement indirect addressing as well as accessing potentially lots of pages in a single instruction.
Of course nowadays we have the transistor budgets to make even complicated instructions work.
Itanium and MIPS were... just fine though. Both architectures have parts that were very competitive along essentially all metrics the market cares about. ia64 failed for compatibility reasons, and because it wasn't "faster enough" than x86_64. No one saw much of a reason to run it, but Intel made them well and they made them fast.
And MIPS failed for the same reason ARM pulled ahead: in the late 90's Intel took a huge (really, huge) lead over the rest of the industry in process and MIPS failed along with basically every other CPU architecture of the era
Amusingly the reason ARM survived this bottleneck is because it was an "embedded" architecture in a market Intel wasn't targetting. But there's absolutely nothing technical that would prevent us from running very performant MIPS Macs or whatever in the modern world.
Itanium failed in the market because it wasn't fast enough running general purpose code (HPC workloads were a different story). As a consequence, it's ability to run existing x86 code was poor (both in hardware and software emulation). At the root, Itanium's poor performance was a direct consequence of the design assumption that a sufficiently smart compiler could extract sufficient instruction-level parallelism. While the ISA didn't prohibit later improvements like OoO execution, it always required a better compiler (and more specific workloads) than real users would have.
The performance (or complete lack thereof) of x86 compatibility mode was one problem. But VLIW's reliance on software to do the right thing (with compilers etc.) was another big thing even for native IA64 code. And a decent-performing Itanium was delayed enough to arrive during dot-bomb vs. dot-com.
If Itanium had really been the only 64-bit chip available for use in systems from multiple vendors maybe it would have succeeded by dint of an Intel monopoly position. But once x86_64 arrived from AMD and Intel ended up following suit, it was pretty much game over for Itanium.
That's kinda revisionist. In fact ia64 owned the FP benchmark space for almost a decade, taking over from Alpha and only falling behind P6 derivatives once those started getting all the high end process slots (itself because Intel was demolishing other manufacturers in the high-margin datacenter market).
The point was that ia64 was certainly not an "inferior" ISA, It did just fine by any circuit-design measure you want. Like every other ISA, it failed in the market for reasons other than logic design.
No, that's pretty much exactly what happened. Were you there?
Beating everyone else at synthetic benchmarks is about as impressive as blowing away paper targets at the firing range. Real applications (and their operating systems) shoot back.
Itanium had some specific performance strong points. FP was one. Certain types of security code was another. In fact (and I may have the details somewhat wrong) but an HP CTO co-founded a security startup, Secure64?, on the back of Itanium security code performance.
But logic design is irrelevant beyond a certain point. Intel certainly had the process chops at the time and knew how to design microprocessors given a certain set of parameters. It's just the parameters were wrong (see Intel's late step-down from frequency on x86 above all else as well--driven I was told at a very high level by Microsoft's nervousness around multicore).
FP benchmarks are just about the most synthetic benchmarks there are though. They mostly matter to people doing HPC, and those kinds of workloads are often in the realm of supercomputers.
Which I think means everybody is saying basically the same thing, ia64 was overly specialized and had far too small of a market niche to survive.
Beside the mentioned points concerning the Itanium, there was another one that is important to consider: the Itanium was simply too expensive for many customers/applications (think: "typical" companies who buy PCs so that their employees can use them for office work; private PC users who might be enthusists, but are not that deep-pocketed; ...).
The '020 was earlier than the 80386 and the '030 better. But yes, by the time of the 80486 and m68040, Motorola was better, but quite late, plus Motorola never had the impetus to bring it to higher clocks, whereas the 80486 got to 100 MHz.
It's certainly not correct to say that the m68060 "pretty much never shipped at all". I have several m68060 systems that would disagree with you. The chip even went through six revisions, with the last two often overclocked to 200% of its official speed. It actually competed well with the Pentium on integer and mixed code, although the Pentium's FPU was faster. Considering the popularity of m68k and x86 at the time, that was pretty darned impressive.
The m68060 is in countless Amiga accelerators, including ones made in the last few years such as Terrible Fire accelerators. Personally, I have a phase5 Cyberstorm MK III and Blizzard 1260. There were Atari '060 accelerators, a Sinclair QL motherboard that has the option of taking an '060, they were a native option in Amiga 4000 Towers, in DraCo computers / video processors, and in VME boards.
I'd love to get one of these to help do more m68k NetBSD pkgsrc package building:
> as the P6 demonstrated complicated CISC addressing modes can be trivially decomposed and issued to a superscalar RISC core.
Doing that required a very large amount of area and transistors in its early days. So much that very smart people thought that the extra area requirements would kill that approach. It still does take a large amount of area, but less and less relative to the available die. Moore's law basically blew past any concerns there.
But it wasn't always obvious that that would be the case.
max instruction size: 486, 12; 040, 22. (x86 has since grown to 15)
number of addressing modes: 486, 15; 040, 44
indirect addressing? 486, no; 040, yes
max number of MMU lookups: 486, 4 (but usually 1); 040, 8 (but frequently 2)
So it wasn’t just manufacturing, it was also the difficulty of the task. 680x0 was second only to the VAX in terms of the complexity of its instruction set.
> The x86 addressing modes are a lot simpler than the 680x0.
Anything but simpler and completely irrational to someone coming from the orthogonal ISA design persepective.
For a start, on x86, the loading of an address is a separate instruction with its own, dedicated opcode.
On m68k (and its spiritual predecessor PDP-11), it is «mov» (00ssssss) – the same instruction is used to move the data around and to load addresses. Logically, there is no distinction between the two as an address and a numeric constant are the same thing for the CPU (the execution context defines the semantics of the number loaded into the CPU register), so why bother with making an explicit distinction?
Having 2x separate instruction for loading addresses and moving the data around would have made more sense if data and address registers were 2x distinct register files, which m68k had and x86 did not, and speaking of the registers x86 was completely starved of general purpose registers anyway effectively having five of them (index registers are semi-general purpose anyway so they do not count). Even x86-64 today has 16x kinda general purpose registers which is very poor. AMD29k, as another extreme, could have 256 general purpose registers and 32x has been the sweet spot for many ISA's for a long time.
Secondly, there was the explicitly segmented memory model with near and far addresses. Intel unceremoniously threw the programmer under the bus with having to explicitly manage segments and offsets within each segment to calculate the actual address, and the address could not cross a 64kB segment. Memory segments have been a commonplace and predate x86, yet the complexity of handling them is typically hidden in the supervisor (kernel) level. m68k, on other hand, has had a flat memory space since day 1 that only really, for all practical reasons, took off with Windows 2000 on x86 – almost 2 decades later after m68k got it.
Lastly, comparing max instruction sizes for m68k and x86 is a bit cheeky. m68k has fixed size instruction encodings that allow the CPU to use a simple lookup table to route the processing flow as well as the extracting addressing mode(s) from the opcode could instantly give an indication of the total instruction length
Whereas x86 has had the variadic ones requiring a state machine within a CPU to decode them, especially as the x86 ISA grew in size, often stalling the opcode decoder pipeline due to the non-deterministic nature of the x86 opcode encoding.
Addresses are normally loaded with MOV on x86 too. LEA is intended to compute an address from base/index registers, using the same addressing modes that can be used to access that memory location.
For example, if you had an array of 32 bit integers on the stack, an instruction like "MOV EAX,[ESP+offset+ESI*4]" would load the value of the element indexed by ESI. Change that MOV to LEA, and it would instead give you a pointer to that element that can be passed around to another function. Without LEA, this operation would require two extra additions and one shift instruction.
Microsoft Assembler syntax confused this issue by being designed to make some operations more convenient, "type safe", and familiar to high-level programmers, while clumsier at getting to the raw memory addresses.
That led to people using "LEA reg,MyVar" when it wasn't necessary, simply because it was shorter to type than "MOV reg,OFFSET MyVar" :)
This is it. Nobody could compete with Intel's R&D budget. Selling low volume high margin chips to low volume high margin Unix workstation companies ended up being a far worse profit proposition than flooding the market with crappy but cheap chips in PCs that anybody can afford. It wasn't even close.
This is what Andy Bechtolsheim essentially said when he talked about why Sun had to develop the SPARC chip. Motorola was just too slow. Great initial architecture not iterated on fast enough.
And when Apple started the PowerPC partnership, they deliberately made sure they had two sources (Motorola and IBM) competing on the same architecture so they wouldn't be beholden to Motorola again (otherwise they were also considering the Motorola 88k)
> I think that's mostly wrong though, because as the P6 demonstrated complicated CISC addressing modes can be trivially decomposed and issued to a superscalar RISC core.
It's not the sequencing that's the issue, it's the exception correctness. Rolling back correctly and describing to the OS exactly which of the many memory accesses actually faulted in an indirect access is very complex. X86 doesn't have indirect addressing modes and never had to deal with that.
> What really killed 68k was the thing no one here is qualified to talk about: Motorola simply fell off the cutting edge as a semiconductor manufacturer.
Great answer but the root cause is Motorola failed to find enough customers to drive volume. Increased volume => increased profitability => money to invest in solving scaling problems. Intel otoh gained the customers and was able to scale.
Partly they failed to get volume is because every single workstation manufacture dropped them because they were clearly doing very badly.
Sun was a big costumer pushing them to make aggressive new chips and they were so disappointed with Motorola that they developed SPARC and released their first SPARC machine at the same time as he 68030 and most costumers preferred the SPARC unless they had software comparability issues.
Apple Mac did get a fair amount of volume to. And there were many other uses as well.
Sure it wasn't the wealth that Intel had, but they were hardly struggling. They clearly sold enough chips to pay a design team.
As @philwelch suggests, it wasn't really a foundry problem.
The Tier 1 Unix system companies (many of whom had been Moto 68K customers) already had their own RISC designs and a lot of the second and certainly third tier companies were getting acquired or going out of business. So by the time there was really a solid 88K product--at least for the server market--almost no one was lined up to design systems around the chip.
Data General did for a while. Forget who else did. But it just never got critical mass.
I mean at some point Apple made it work with their in-house ARM-based CPUs - first for phones then for laptops and desktops. But that was many years later and with a truly massive budget commitment to help the foundries and with enough volume that inhouse designs made sense and with control of the software environment on top. Not impossible but not exactly circumstances available to many companies.
Sun and DEC and IBM made CPUs for their own computers too - but not to compete for basic PCs. Motorola made a lot of phones at one point but not to the degree that they could lock in top of the line fabs.
It's not that the 68000 family was necessarily impossible to use in lower priced PCs by the way. Philips built the 68070 for use in CD-I and other consumer machines. And Apple and Amiga made it work for a while with more mainstream parts.
During this entire period there was an explosion in microprocessor designs. Let's say most of them better than the x86 generation for each of them. That would still not be enough to compete with the IBM PC-Compatible in rough price-performance and in amount and ease of procuring "personal computer" types of software. A superior technical solution is NOT sufficient to win.
I remember a technical presentation by Intel on the microarchitecture of their latest generation. And thinking "Wow, they threw everything and the kitchen sink in there - each piece to gain ~1% performance." And the thing is, it was enough. That chip was arguably a mess - and it was sufficient to keep their market going.
This is true only because today the borders between RISC and CISC do ot exist anymore, modern technologies would decode everything into uops anyway. But in 1980s and early 90s, this was not true. CISC was indeed more difficult to scale.
Yes now you can do that. But that 'trivial' thing you talk about isn't that trivial and it also requires quite a few gates. And this was an area where you didn't just have lots of gates left over.
> '030 was still very competitive
Questionable. A simple ARM low power chip beat it. And the RISC designs destroyed it.
ARM2 even in 1886 basically doubled up 68020 not to mention the bigger RISC designs.
Lots of companies were taping out RISC designs by 1985 and all of them basically beat them.
Aren't they? I recently read some comparison between the latest Ryzen 7840U and Z1 vs the Apple M2, and the Ryzens were significantly faster at the same TDP.
And that is without getting into the high-end segment, where Apple doesn't have anything that can compete with the Threadrippers and Epycs.
Is this true in anything besides cinebench? It’s always cinebench I see cited and that’s possibly the most useless efficiency figure available (bulk FPU tasks with no front end load is stuff that should be done on gpu, largely, CFD and other HPC workloads aside).
How does it do in JVM efficiency (IDE) or GCC (chrome compile) efficiency? How does it do in openFOAM efficiency?
Not on the same processes they aren't, no. Zen 3 is on 7nm. Apple is shipping chips on 5nm, and has reportedly bought up the entire fab production schedule of 3nm for the next year for a to-be-announced product.
Again, everything comes down to process. ISA isn't important.
This talking point is outdated. AMD has been shipping 5nm CPUs for a year now.
The 3nm parts have just barely dropped in phones due to process delays and poor yields (N3B will miss targets and only N3E will meet the original N3 targets a year or 18 months late). So they are a non-factor in the pc/laptop market.
Happy to see efficiency numbers (something other than cinebench please) but there is no node deficit anymore. Apple laptops and zen laptops are on even footing now.
If there is still an efficiency deficit then the goalposts will have to be moved to something like “os advantage” (but of course Asahi exists) or “designed for different goals!” (yeah no shit, doesn’t mean it’s not more efficient).
Again, I am fairly sure that apple creams x86 efficiency in stuff like gcc (chrome compiles) or JVM / JavaScript interpreting (IntelliJ/VS Code) even without “the apple software”, but, everyone treats cinebench like it’s the end-all of efficiency benchmarking.
I absolutely know the apple stuff creams x86 by multiples of efficiency (possibly 10x or more) in openFOAM - a Xeon-w 28c (or even an epyc) pulls like, 10x the power for half the performance of a m1 ultra Mac Studio.
(And yes, “but that’s bandwidth-limited!” and yes, so are workloads like gcc too! Cinebench doesn’t work like 90% of the core, it doesn’t work cache at all, it doesn’t care about latency. That’s my whole point, treating this one microbenchmark as the sole metric of efficiency is misleading when other workloads don’t work the processor in the same way.)
Node proccess. Apple can out-spend AMD 100 to 1, they use their deep pockets to buy all the fabrication capacity of TSMC's latest proccess for a while, meaning AMD only has access to the previos one. This is where Apples perf/watt advantage comea from.
What really killed 68k was the thing no one here is qualified to talk about: Motorola simply fell off the cutting edge as a semiconductor manufacturer. The 68k was groundbreaking and way ahead of its time (shipped in 1978!), the 68020 was market leading, the '030 was still very competitive but starting to fall behind the newer RISC designs, leading its target market to switch. The 68040 was late and slow. The 68060 pretty much never shipped at all (it eventually had some success as an embedded device).
It's just that posters here are software people and so we want to talk about ISA all the time as if that's the most important thing. But it's not and never has been. Apple is winning now not because of "ARMness" but because TSMC pulled ahead of Intel on density and power/performance.