Disclaimer: I don’t know anything about this stuff. I mostly code Python.
Setup
Computer chips (that includes CPUs, GPUs, RAM, and solid-state storage) are expensive. They’re expensive because they’re made in ultra-high-tech lithographic fabs that cost over a billion dollars to build. The chips have to be expensive because the fab’s cost must be amortized in the limited amount of time before the next process shrink, at which point the old fabs are obsolete (or at least behind the times).
As many have noted, there’s a fundamental limit to how many more process shrinks we can have. The latest chips are being produced at a 32 nm process. If you’re willing to extrapolate out Moore’s law for transistor density, then in about 20 years we’ll be at 1 nm process, and further shrinkage is impossible because that’s about the size of a single atom. If you’re a pessimist you think progress will stop sooner, and if you’re an optimist you think we’ll reach that limit sooner, so either way, we’re likely to be mostly done within a decade or two.
As process improvement becomes incremental, the commercial lifetime of fabs will increase dramatically, and the same is true of semiconductor designs. An Intel Core i7 is mostly faster than a Pentium II because of process shrinks. Without changes in process, a single chip design could be sold until the dominant cost consideration is the raw materials themselves… and with high volumes, I suspect that even ultra high purity silicon isn’t all that expensive.
Prediction
The question then is: if silicon is cheap, what do you put on it? From my perspective, the answer is “everything”. Different kinds of tasks require very different kinds of functions, and the most efficient implementation of any task, measured in time or power, is always a special-purpose circuit. A general-purpose computer might be used for word processing, massively multithreaded relational databases, sound manipulation, 3D graphics, physics simulation, videoconferencing, or some obscure low-latency stream processing. These tasks are respectively most efficiently accomplished by a few general-purpose CPUs, a massive number of logic-focused processors, DSP with hardware FFT, a GPU pipeline, an enormously wide floating-point vector machine, codec-specific encode/decode/muxing, and an FPGA. A future main chip might contain all these functions on a single die. If you’re not using them, they don’t draw power, and so are “free”.
This trend has very much already started, to the point that the most dubious thing about this prediction may be calling it “long term”. Today, AMD/ATi, Intel, and VIA each sell a complete package of: a general-purpose CPU, a vector unit attached to the CPU, a GPU with a 3D pipeline, a huge vector unit accessible on the GPU, and codec-specific demux and decode for roughly 5 different codecs. Texas Instruments OMAP3 includes all that (though smaller), plus a DSP. Both AMD and Intel have vowed to combine their functions onto a single die in the near future. Sun has demonstrated the effectiveness of Niagara in certain workloads, and Tilera and Intel’s Larrabee have moved even further down the polycore path.
Software Architecture Implications
One remarkable issue apparent with the growth of GPUs is how hard it is to allocate resources in heterogeneous environments. I’m not aware of any operating system that actually attempts to schedule processes on graphics cards or DSPs. This lack of scheduling hasn’t been a terrible issue in practice … because software that makes use of these special-purpose processors is so hard to write that most of a user’s software doesn’t require it!
The best approach I’ve seen to making use of special-purpose hardware is the one beginning to bubble up in the Gstreamer project. Gstreamer is a semi-special-purpose dataflow framework that knows something about the nature of the data that is flowing. Specifically, it knows the type of the data, the series of high-level operations that are needed, and the available implementation of these operations. Soon, it will know something about the underlying hardware, and the costs of performing operations in various places. The goal, then, is to be able to say “overlay the text from this file, rendered in this font, over this video, and display to this screen”, and let gstreamer work how which system components should be responsible for which steps. For multimedia, this is exactly the right way.
I think this is the way forward: a framework in which one composes high-level operations on typed inputs. If this becomes popular enough, then we really will have a scheduling problem, which leads me to a prediction: gstreamer or its successor will be integrated with the kernel, and especially the scheduler. The only way a scheduler will be able to allocate these heterogeneous resources effectively is if it can see the detailed structure of the tasks themselves. It needs to see things like the relationship between different pieces’ realtime deadlines, and the different possible processor allocations for all running pipelines. This is especially important given the bizarre topologies that seem to be inevitable in designs like Tilera’s.
Software Politics Implications
Right now, integrated chips like TI’s OMAP are also among the worst offenders in the area of proprietary drivers and undocumented functionality. To a businessman focused on differential gains, this makes perfect sense, because there’s nothing a corporation hates more than commoditization. By keeping the public abstraction barrier high, the manufacturer raises the costs for others to reproduce its work, limiting the ability of competition to squeeze prices against manufacturing costs. As semiconductor production becomes increasingly commoditized, the incentive to hide the functioning of the chip becomes higher and higher.
The other problem is that even if the designs aren’t secret, the choice of which functions to implement has a huge impact on the choice of software. The example of the moment is the ubiquitous MPEG-4 accelerator chips. They implement the patented algorithms required to decode MPEG-4, requiring licensing fees both to produce and to use. They are largely undocumented, perhaps due to fear by their manufacturers that releasing documentation would be seen as encouraging patent infringement. As long as patents on software are enforced, some component of a complex chip will likely require a patent license to use. It’s not just video codecs, either: an obvious example of the moment might be a .NET/CLI bytecode acceleration unit.
At a higher level, there’s a political question about compatibility. Currently, CPUs (even of different architectures) are all essentially C processing units, and so most code written in C (or higher) will run across any of them with nothing more than a recompile (and often less than a recompile). As chips acquire more special-purpose functional units, making use of them is going to require something more than ISO C. If chipmakers don’t agree ahead of time to standard abstraction barriers (like OpenGL or VHDL), it could quickly become quite difficult to build an operating system that runs the same applications on multiple architectures. DSPs are already in this position, requiring hand-tuned assembler that’s different for every DSP. Moreover, manufacturers who fear commoditization will shun standards, rendering it almost impossible to use the whole chip without tying yourself to it. We can work around this, by providing high-level operators (e.g. a Theora decoder) with many different backends, but an enormous amount of labor will be required.
Open Issues
My number one unanswered question is: what will the memory topology look like? I’m fairly certain it’s going to look complicated, and severely nonuniform, but beyond that I’m stumped. My best guess is that the components of a chip will be wired together by an on-chip bus, with independent caches and maybe a few different main RAM banks.
As silicon gets faster though, I wonder if the latency to main memory will become unbearable. The only solution is to move the memory closer to the chips, onto the same die. IBM’s POWER designs seem to be headed this way already, with DRAM on the CPU. If you already have NUMA, then naturally you want the memory to be nearest to its functional unit… or perhaps the reverse! Maybe the future is widely separated functional units, each surrounded by its own RAM bank.
The really interesting question here, though, is what happens if nonvolatile RAM picks up. Will we have memory, storage, and processing, all mixed up in a single bank of identical chips? Maybe cache, RAM, and disk will be replaced by a continuum, from the bits that are closest, and therefore fastest, to the ones that are furthest away.
That would be cool.