Jan Feb Mar
Apr May Jun
Jul Aug Sep
Jan Feb Mar
Apr May Jun
Jul Aug Sep
Oct Nov Dec
Apr Aug Sep
Oct Nov Dec
D. Sulik, Bournemouth University, et al,
Design of a RISC Microcontroller Core in 48 Hours (PDF).
8-bit RISC; "48 man-hours", using Celoxica's Handel-C; targets the XS40-010XL;
338 CLBs; clocks at about 12 MHz, but each instruction requires four clocks.
Robert Ristelhueber, Electronics Buyers News:
Xilinx set to sample FPGAs built on 300mm wafers.
"The company is also pushing the envelope of process technology for its
products, Sevcik said, with its first 0.13-micron chip scheduled to be
released around January 1."
"That device will have nine mask layers and be built using IBM's copper
process. Called the Virtex-II Pro, the chip will have an embedded PowerPC
processor and will run at 300MHz and higher speeds, said Sevcik."
On the fpga-cpu list, Veronica Merryfield
that it is important to consider such issues as cache invalidation,
context switching, MMUs, and IPC, in designing a processor:
"In short, think about the kernel software and the core features together not
Good advice! Later,
these various ideas and comments on fast context switches in FPGA CPUs:
First, let us tip our caps to the designers of
the Xerox Alto [Chuck Thacker et al] -- frequent microtask switching,
as often as every few cycles, doing some I/O work in the same datapath
as the CPU;
the Inmos Transputer [David May et al]: fast task switch -- stack architecture with limited state to switch);
the Denelcor HEP and the Cray (nee Tera) MTA [Burton Smith et al]: (multiple thread contexts with thread switch each cycle);
and the i960CA [Glenn Hinton et al] (fast reg file save/reload via wide buses).
Thinking about FPGA CPUs, one can, of course, use block RAMs (BRAMs on
Xilinx, ESBs on Altera) as really deep register files, vector register
files, windowed register files, or multi-context register files.
(See Using block RAM.)
Block RAM register files tend to be slower and less-ported than
register files built with LUT RAM. On Altera, which lacks LUT RAMs,
you might as well pursue one of these avenues, and that's just what
Altera Nios does -- register windows. (See also Flex10K CPUs (1997) and
"EABs could certainly excel at building
LARGE register files (e.g. for vector registers or multiple thread
contexts or register windows) ... "
There is a duality between a windowed reg file and a multi-context
reg file. If I am not mistaken, you can do limited fast context
switches on a SPARC with a window rotation or two (8 global registers
notwithstanding). Perhaps the same idea would work on Altera Nios.
One can also build a simple barrel processor (say 4 threads (slots) x 32
regs = 128 entries of 32-bits = 2 16-bit ports on a single 256x16 BRAM,
tripled cycled, or two BRAMs double cycled) and switch threads on each
cycle. Then you can have a 4-deep pipeline without need for any result
forwarding muxes (by the time you read an operand on thread[i], you have
already retired that threads' previous result to the register file).
This seems to me to be a perfectly simple and practical basis to
issue instructions faster than the ALU + result forwarding mux +
operand register recurrence critical path. Unfortunately single-thread
performance is not so hot but in workloads such as a "network processing",
This idea was taken to sublime levels in the 20-stage pipelined
5-threaded 1 GHz MicroUnity MediaProcessor (which would have needed some
result forwarding, but not 18 stages worth).
You can do the same thing (multiple context register files) with LUT
RAM, of course. In fact, it is quite trivial to make the xr16 (with its
PC/DMA register file LUT RAM) multi-threaded, so long as you divvy
up the available 16 general purpose registers to the available threads
(or make the general purpose register file larger). Just don't switch
threads on interlocked instructions, such as immediate prefix, which
are stateful between instructions. (Or, of course, make the imm prefix
register a register file too -- not a good use of LUTs though.)
(Of course, in the context of the XSOC system, the xr16 uses this facility
to do cycle-stealing DMA transfers using the xr16's PC reg
file and PC incrementer.)
The old superscalar i960CA achieved a fast procedure call/ret by having
a wide (128 bit) and fast bus between the 6 ported reg file and the
internal RAM and register file cache. On a CALL it could save the
16 local registers in just 4 cycles. This is entirely feasible in a
pedal-to-the-metal multithreaded FPGA CPU using (say) 4 BRAMs configured
each as 2x128x16 (or else 2 BRAMs at 2x256x32 in Virtex-II).
Like the good old Transputer, you can build a stack machine backed by
BRAM, so that a context switch is simply saving/restoring the stack
pointer register and perhaps a very few other task/process related
Finally, I have a (new?) wacky idea for doing fast context switches using
a LUT RAM register file backed by BRAM. (I don't like this idea enough
to actually try it, but you may find it interesting anyway.)
Assume we can't or won't use a purely BRAM-based multi-context register
file because it is not as fast or as multiported as we want (esp. if
we are doing a 2-issue super or an LIW -- BTW I sketched a simple
2-issue 6R2W-register file LIW 7 years back -- see the latter half
of Homebuilt processors). No, we must use a single-thread-context LUT RAM
register file. In that case, on a context switch, we would like to save
the current reg file from LUT RAM to BRAM and reload the new threads'
reg file from BRAM into LUT RAM.
First I must note the idea (that follows) is a win only if the new
thread only reads a subset of its registers before another context
switch occurs. But that's fine. If you aren't going to switch threads
very often, the amortized cost of the context switch is insignificant.
If you are going to switch threads as often as 20-100 instructions,
then this idea might pay off for you.
Here's the idea. For concreteness, assume 8 threads of 32 registers,
with a 32x32 LUT RAM reg file and a 256x32 BRAM-based 8-thread-context
backing store. Build 32 "valid register" flip-flops. On a context
switch, these can be reset in one cycle. For each read port into the
reg file, build a 32-1 mux to fetch that port's register's valid bit.
For each write port into the reg file, allocate a corresponding write
port into the BRAM. As each instruction result is retired into the LUT
RAM, it is also retired into the BRAM.
After a context switch, all valid register flops are cleared. Then on an
instruction like "add r3,r1,r2", we'll find that r1 and r2 are not yet
valid (present) in LUT RAM and stall and fetch them from the BRAM-based
multi-context reg file backing store. This may well take a cycle or
two per register "read miss" (perhaps fewer if you do heroic things with
double-cycled LUT RAMs and multiple BRAM ports).
(Again, remembering the duality of thread context switch and function
call, I note that this same mechanism can be used to do very fast function
call/return -- on each CALL or RET update the block RAM register window
address counter and clear all the register file valid bits. This provides
all the fast call/return benefits of deep register windows plus the
benefits of a fast small register file. You never need to save registers
in a function prolog (because they're always concurrently retired into
the backing store), and you never need to reload registers in an epilog
(they'll be reloaded on demand in the return site continuation). Again,
in typical C/C++/Java code, with a lot of function calls, "much of the
time" (hand waving) you typically don't read more than a quarter of the
registers in the register file before making another call or return.)
This idea is an example of the hybrid LUT RAM + BRAM idea I mention
in the aforementioned Using block RAM article/disclosure:
"... hybrid uses of large embedded RAM blocks together with smaller distributed RAM
blocks to achieve large storage capacity with highly multiported access
to a subset of that storage".
(By the way: the above valid bit per entry discrete FF + mux trick can
also be used to flash invalidate a small (e.g. LUT-RAM-based) cache
the 0.18 micron 1.8V Spartan-IIE.
(Please, Xilinx, also give us the option of a single PDF.)
You might think that
as Virtex-E is to Virtex, so is Spartan-IIE to Spartan-II
But you would be wrong. According to data sheets,
whereas an XCV200 has 14 BRAMs (56 Kb) and the XCV200E has 28 BRAMs (112 Kb),
in the Spartan-II/E family, both the XC2S200 and (alas) the XC2S200E have the
same 14 BRAMs (56 Kb).
If your work is "BRAM bound", as is my multiprocessor research, this is
Anthony Cataldo, EE Times:
Xilinx spins cost-reduced FPGA for digital video.
'The company said stripping away some of the RAM is a safe bet. "We're
finding that even in Spartan 2, designers are not using all the block
memory that's there," said Steve Sharp, senior manager of silicon
But let us count our blessings. The new Spartan-IIE family is lower-voltage,
faster (470 ps TILO (2SxxxE-6) vs. 700 ps TILO (2Sxxx-5)),
offers a larger part (the 32x48 CLB = 6144 logic cell XC2S300E),
supports tons of different I/O signalling standards, and
thank you Xilinx comes in TQ144 and PQ208 QFP packages.
Crista Souza, Electronics Buyers News:
Xilinx's new FPGAs aimed at consumers.
Murray Disman, ChipCenter:
Xilinx Introduces Spartan-IIE.
Return address linkage
Goran Bilski, the designer of the Xilinx
MicroBlaze soft CPU core,
on the benefits of keeping return addresses in general purpose registers in
this fpga-cpu list thread.
My two cents.
Happy 30th birthday to the
"Introduction date: November 15, 1971
Ah, the good
Clock speed: 108 kilohertz
Number of transistors: 2,300 (10 microns)
Bus width: 4 bits
Addressable memory: 640 bytes"
Gordon Moore (1965):
Cramming more components onto integrated circuits.
"Over the longer term, the rate of increase is a bit more uncertain, although
there is no reason to believe it will not remain nearly constant for at
least 10 years. That means by 1975, the number of components per
integrated circuit for minimum cost will be 65,000."
Andy Shaw: Intel 4004 History: A Rashomon Story.
Ron Wilson, EE Times:
Inventor recalls birth of the MPU.
"... Remember that I was working with a very small number of gates. I ran a little calculation the other day, and in today's processes a 4004 would fit under a bonding pad."
PDP-8/X is now joined
by his new PDP-4/X
(in an XC4010E).
Tom Cantrell, Circuit Cellar Online:
An insightful introduction to MicroBlaze and its instruction set architecture.
It looks like a nice clean and
simple ISA after my own heart.
Once again Tom beautifully frames the business model considerations of FPGA CPU IP from FPGA vendors vs. IP from third parties:
"It all boils down to the fact that burying the IP price in a chip is the
most streamlined way to accomplish the transaction. In a world wracked by
Napster-like IP angst, the bit of plastic and silicon we call a chip is
(just like plastic and paper that go into an audio CD) a handy place to
hang the price tag. In essence, it's a royalty scheme without all the
handwringing about opening the books, audits, dongles, and the like."
See also IP business models.
Peter Clarke, EE Times:
Student's ARM7 clone disappears from Web.
FPGA CPU News, Vol. 2, No. 11
Back issues: Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.