Jan Feb Mar
Apr May Jun
Jul Aug Sep
Jan Feb Mar
Apr May Jun
Jul Aug Sep
Oct Nov Dec
Apr Aug Sep
Oct Nov Dec
Unfortunate Yahoogroups marketing preferences change|
Our fpga-cpu list was originally hosted by egroups.com. They were
acquired by Yahoo!, and since then, some members have subscribed via a
Please take a moment to read
'Yahoogroups just did something outrageous: they silently added something
called "Marketing Preferences" to your user profile, and the default is
opt-in for spam. ...'
If you have a Yahoo! account, I encourage you to review your
Ron Wilson, EE Times:
The Dirty Little Secret.
New problems caused by die level process variations.
'"Look at interconnect," he said. "When you make a coast-to-coast phone
call, you don't depend on uniform wire thickness between here and New
York. You use protocols that make your transmission independent of
the physical medium. We will do the same thing on ICs: protocol-based
communications with error recovery instead of point-to-point wiring."'
You know, if you stop to think about it,
EE Times is just so damn good.
It's their wonderful reporters. It's a joy to read their pieces.
They go beyond regurgitating press releases and find new angles on stories,
relate inside perspectives and motivations behind the PR spin.
Often, as in this case, they go to conferences that we can't,
and allow us to share, vicariously, in some of the interesting buzz.
They're obviously techies. You don't catch EE Times reporters
confusing megabits and megabytes. They seem more resistant to
bamboozling by marketing. And they seem to report more news stories,
faster, than the other publications (many of which are also very good).
I visited the Cahners/EE Times booth at ESC and said as much.
Happy second birthday to XSOC/xr16, and to this web site, and to the
fpga-cpu list. Two years ago this month, Circuit Cellar ran part one of the XSOC
Since then, we have witnessed high powered FPGA SoC/CPU product
offerings, first from Altera, whose Nios product truly legitimized the
market, and then Xilinx, and have seen a groundswell of interest
in this field.
Sometime this winter this site passed a half million "page views",
and of late we about average about 600 "visitors" per day.
The mailing list has over 450 subscribers and has seen over 1,000 messages.
I wish you all, dear readers, a sincere thank you for your interest.
a Shared Source
implementation of the
Common Language Infrastructure.
Developmentor's DOTNET-ROTOR mailing list.
David Stutz, Microsoft:
An Architectural Tour of Rotor.
Get Your Rotor Running.
(Meanwhile, work continues apace on Mono,
"an Open Source implementation of the .NET Development Framework".
Recently Mono reached the important milestone of self-hosting
most of itself.)
"You may be wondering, how is .NET relevant to FPGA CPUs? Among other
things, the .NET CLR looks like a great platform to build multi-language
cross-compilers for new instruction set architectures."
New FPGA CPU Projects
(Java Optimized Processor). You'll enjoy the pages detailing
the evolution of the design, and the comprehensive set of
to other Java processors.
JOP Hardware Status:
Note "JVM: about 25% of byte codes implemented"...
- "processor core runs on an Altera ACEX 1K30-3 (a $13 FPGA) with 24 MHz"
- "JOP1 core: about 600 LCs and 6 EABs (1/3 of the FPGA), max. clock 30 MHz"
- "JOP2 core: 380-450 LCs with 40 - 45 MHz in an ACEX 1K30-3"
- "JOP3 core: 900 LCs, max. clock 44 MHz in an ACEX 1K30-3"
- "two JOP2 cores fit in one 1K30 (and are running)"
- "periphery: IO port, SRAM- and Flash interface, UART, ECP, IIC"
- "performance: JOP3 is about 11 times faster than interpreting JVM on an Intel 486"
Ali Mashtizadeh announces
his HyperMTA project
It is to be a MP-MT-VLIW. My thoughts.
That said, good luck, I think MTAs are a plausible and promising approach
to extracting performance and available parallelism from dusty deck source code.
- It is easy to implement a VLIW in an FPGA -- although personally I'd stick to an LIW.
- It is straightforward to implement a MT-VLIW in an FPGA,
using BRAM for multi-context register files.
- It is challenging to implement a memory system for a MT-VLIW in an FPGA,
assuming it is expected to keep up with one load or store per cycle.
- It is very challenging to implement a memory system for a mulitprocessor
- It is very, very hard to implement an effective suite of compilation tools
for a MP-MT-VLIW.
We discussed multithreaded FPGA processor architectures back in November.
Programmable World 2002:
April 17, 2002, at a simulcast near you.
2002 IEEE Symposium on Field-Programmable Custom Computing Machines,
April 21 - April 24, 2002, Napa Valley, CA. Early registration ends April 1!
Xilinx Introduces Breakthrough Cost Reduction Technology for Virtex-II FPGA Family.
Press pitch (PDF slides).
Hmm, I thought I saw some slides which discussed die failure modes,
bridging faults, used and unused LUTs and PIPs, and the like,
but now I can't find them to link to them. Help, anyone?
The general idea, of hosting logic in a partially-defective FPGA, is not new
(and may well be one of the most important ideas in the coming era of
Bruce Culbertson et al, HP Labs,
Defect Tolerance on the Teramac Custom Computer (PS)
Philip Kuekes, HP Labs, on
Defect Tolerant Molecular Electronics.
However, prior to EasyPath, I am not aware of any proposal to take
a one or more fixed designs and test partially defective FPGA lots
against those design(s)' resources.
I wonder: is the cost reduction a function of utilization?
If I design a floorplanned datapath that (more or less) uses every
input of every 4-LUT; uses every block RAM;
uses as many interconnect resources as possible -- would that
design cost more to EasyPath-ize than a more scattershot one produced
by a synthesis tool and scrambled up by PAR?
And I wonder if Xilinx will also permit a customer with a reconfigurable
design to test each configuration of that design?
Taken to the extreme, the answer would have to be no, otherwise a
cost-sensitive smart aleck customer could submit their own set of
reconfigurable 100% test coverage designs, then buy lots of defect-free
EasyPath FPGAs. :-)
Anthony Cataldo, EE Times:
Xilinx looks to ease path to custom FPGAs.
Crista Souza, EBN:
Xilinx's EasyPath seeks to skirt costly redesigns.
Murray Disman, ChipCenter:
Xilinx Introduces EasyPath.
Altera SOPC Builder and Microtronix Embedded Linux for Nios
For me, one of the highlights of the
Embedded Systems Conference
was the Altera booth demonstrating Altera's SOPC Builder 2.5 for Nios,
Embedded Linux Development Kit
To the Nios development board, Altera and Microtronix added a
card with some FLASH and RAM, and another card with interface to an IDE
hard disk, and another with a CS8900A-based 10baseT ethernet interface.
What a thrill to power the thing on, watch uCLinux boot, uncompress
itself, autoconfigure, create a RAM file system, and then bring up
a login: prompt.
I logged in and found myself in /bin/sh (or facsimile). The file system
was necessarily austere and (no surprise) the Nios GCC dev tools were
not there, but it was still great to run "ps" and find a dozen processes
running including inetd. Yes, this demo was configured to serve up
telnet and http back to the PC through the ethernet connection.
Now let me tell you about SOPC Builder. Altera has put a lot of good
thinking, and polish, into this tool. It lives in Quartus II (I think)
and presents you with a helpful GUI to configure your Nios processor,
your peripheral modules, your Avalon bus, bus masters, interrupt
priority assignments, and so forth.
There was considerable depth to the product. For example, it seemed
to ease the complicated job of configuring a system with arbitrary
topologies of multiple buses, bus masters, peripherals, interrupt
sources, etc. by presenting the entire configuration in a clever
"spreadsheet" format that shows which modules connect to which
modules on which buses.
And SOPC Builder is slick. For example, as you check or uncheck
feature boxes, you see a running total of number of LEs required
for the design. This works even for third party modules (cores).
Another example: as you add modules to your design,
SOPC Builder helps configure a basic test bench to test those
modules. For example, if you instantiate a UART, you can also specify
some a string of characters to transmit into the UART,
in the test bench.
Significantly, the design of SOPC Builder facilitates third party
extensions. Third party module authors write scripts (text files)
to define new tabbed dialog box property pages for their modules.
It appeared that you can even embed Perl code into the scripts to
perform arbitrary actions (e.g. LE counts) as the system designer
specifies module options and parameters.
One of the more impressive SOPC modules (in beta?) was the Microtronix
uCLinux kit itself.
It seemed to allow easy configuration of many uCLinux parameters
in the software domain. Here's a Microtronix PowerPoint
demonstrating this module.
Now here's a scenario that shows the potential of all this integration.
You add the uCLinux module. You check a box to select some uCLinux networking
features. Besides configuring the software, SOPCBuilder could
potentially (or perhaps it already does)
prompt you to add (and automatically configure) a network interface or
serial line (SLIP) interface to the hardware configuration
of your system design.
(One tools integration challenge: can you handle arbitrary incremental
changes forwards and backwards? For example, if later I remove
the network interface, I should see a dialog noting this change
is inconsistent with my specification of uCLinux networking features.)
Do you know what SOPCBuilder reminds me of? Visual C++ 1.0.
Ten years ago, I had the privilege of working on Microsoft C/C++ 7.0,
and the first several releases of Visual C++. C7 was an OK compiler,
an enabling technology. But it did not begin to solve the
biggest problem facing our customers, which was writing decent
Windows platform applications. A whole generation of potential Windows
developers faced the daunting challenge of learning both the Windows API
plumbing details, and more difficult, the overall "application
architecture" of a working Windows application. How to architect
the application into documents and views, how to do scrolling windows,
how to manage coordinate systems, how to do printing and print preview,
how to do menus, tool bar buttons, and so forth. And even if they did grok
Windows application architecture, the day-to-day grind of composing
dialog boxes and then adding object oriented method declarations and
bodies to do UI event handling was just too tedious.
In those days, the Visual C++ 1.0 designers added a class library, MFC,
to provide a prepackaged, working application architecture,
integrated the GUI design tools with the code editing tools,
and provided Wizards (new in VC++) to pre-configure a new working project,
and (ClassWizard) to do the boilerplate code and event dispatching
you always need as you added gadgets to your GUI.
This integration was a great advance. The new project Wizard allowed
every Windows developer newbie to get their first Windows app running in
one minute. In all, VC++ probably cut several months of time,
and at least one throwaway product, out of everyone's Windows
development learning curve.
One effect of Visual C++ was that the industry dialog moved away
from benchmark wars and "whose compiler optimizes better?" and towards
arguing "whose development environment helps you get your work done quickest
In a similar way, SOPC Builder looks like it could eliminate many barriers
to getting that first crucial design working, providing a nice "out of
box experience" for newbie Altera SOC designers. And with all the system
parameter data it manages, it also has the potential to help the
system designer with day-to-day integration tasks and bookkeeping.
So perhaps Altera has an early Visual C++ on their hands.
It remains to be seen if they can provide a sufficiently deep and
still easy to use integrated environment, but the early returns look good.
Murray Disman, ChipCenter:
Altera Announces SOPC Builder.
Crista Souza, EB News:
Altera casts spotlight on IP integration tool.
Disclaimer: all of my remarks are based upon booth demos and two
Altera Nios and Excalibur seminars. I have precious little experience
with actually building anything original with Nios or SOPC Builder.
Your mileage may vary. That said...
I want you to know that after speaking with them, Altera volunteered to
donate a NIOS Kit to me (my company), an unexpected and generous offer
which I have gratefully accepted. There was no discussion of quid pro
quo as regards this web site.
For the record, we have to date been given low cost teaching kits
by two other companies, never at my instigation.
Except for these donations, we've paid for everything else we use,
which includes not inexpensive licenses from Synplicity, Aldec,
Products and ideas are mentioned here only because they are
interesting to me or because I think they may be interesting
or useful to you, dear reader.
And you may as well assume I have invested in one or more programmable
logic related companies.
The end of monolithic microprocessors
The keynote speech by Henry Samueli of
aptly demonstrated that we have left the era of
monolithic microprocessors (except perhaps for personal computers)
and are now working into the era of highly integrated systems.
Take for example, the Broadcom
VoIP Broadband Gateway, which integrates a myriad functions into
a single chip. Oh yes, by the way, there is a tiny MIPS core down
in there, somewhere.
John Mashey's greatest hits
John Mashey, then mash (at) mips.com, used to write these amazing,
thought provoking, authoritative, long pieces for comp.arch. You could
learn an awful lot about real world computer architecture just by
reading the discussion threads he contributed to.
This site has some
of his best pieces. Do yourself a favor and spend a few minutes
reading just the
last of these.
Paul Glover, Xilinx, in ISD:
FPGA Is as Good as its Embedded Plan.
"Floor planning allows designers to control placement and grouping of
the embedded processor, the associated intellectual property and their
custom logic, thereby simplifying the process of complex system-on-chip
development and increasing design performance."
More on Jan's Razor|
Re: last week's Jan's Razor piece,
there was an interesting discussion on the fpga-cpu list.
the idea that a cluster of uniprocessors with some shared
resources is the best approach, and noted that in a VLIW you can
indeed allocate fractional resources to each execution unit.
"The limitations you claim for uniprocessor design exist only if you
restrict yourself to scalar processors. For superscalar and VLIW
designs, you have a similar freedom to ratio function unit types.
For example, a typical superscalar design with three integer issue
slots will support multiply on only one of those."
Applications of racks full of FPGA multiprocessors?
Reinoud also asked
"Finally, there may of course be applications where the
relatively large amount of control provided by the many
instruction streams in your approach (Sea of Cores - SoC?;)
are an advantage. The challenge will be in finding those
applications... Can you think of any?"
[First, let me note that most of the time, I too would prefer 1 1000
MIPS processor to 10 200 MIPS processors or 100 50 MIPS processors.
That said ...]
Read along with me, to the sound of future patents flushing...
I confess, looking at the V600E 60-way MP I described recently, or its
logical follow ons in V2 and so forth, I confess that these are paper
tigers, with a lot of integer MIPS, in want of an application.
Aggregate "D-MIPS" is not an application!
I suppose my pet hand-wavy application for these concept chip-MPs is
lexing and parsing XML and filtering that (and/or
parse table construction
for same). Let me set the stage for you.
Imagine a future in which "web services" are ubiquitous -- the internet
has evolved into a true distributed operating system, a cloud offering
services to several billion connected devices. Imagine that the current
leading transport candidate for internet RPC, namely SOAP -- (Simple
Object Access Protocol, e.g. XML encoded RPC arguments and return
values, on an HTTP transport, with interfaces described in WSDL (itself
based upon XML Schema)) -- imagine SOAP indeed becomes the standard
internet RPC. That's a ton of XML flying around. You will want your
routers and firewalls, etc. of the future to filter, classify, route,
etc. that XML at wire speed. That's a ton of ASCII lexing, parsing,
and filtering. It's trivially parallelizable -- every second a thousand
or a million separate HTTP sessions flash past your ports -- and
therefore potentially a nice application for rack full of FPGAs, most
FPGAs implementing a 100-way parsing and classification multiprocessor.
Reinoud brought up VLIWs. Rob Finch is designing
one and had
"I want to do a VLIW processor in an FPGA, but I'm not confident what
I'm doing. Does anyone have links to sample (educational) VLIW
processor designs ? I've searched the net but can't seem to find
(which in retrospect didn't really answer Finch's questions):
VLIW ::= very long instruction word: a machine with one instruction
pointer (how's that, Reinoud? :-) ) that selects a long instruction word
that issues a plurality of operations to a plurality of function units
In a VLIW system, the compiler schedules the multiple
operations into instructions; in contrast, in a superscalar RISC, the
instruction issue hardware schedules one or more scalar instructions to
issue at the same time. You may also see the term LIW vs. VLIW. I
think of an LIW as a 2 or 3 issue machine, a VLIW as a 3+ issue machine.
In some VLIWs, to improve code density, a variable number of operations
can be issued each cycle.
There's a rich literature on VLIWs (that I don't purport to have read),
but be sure to see: John R. Ellis, Bulldog: A Compiler for VLIW
The IA64 EPIC (explicitly parallel instruction computing) architecture
is the most notable VLIW derivative. Several DSPs like the TI C6x are
LIWs. A famous early VLIW was the Multiflow Trace family, a
supercomputer startup that lived approximately 1984-1990.
The challenging part of a VLIW project is the compiler. Unless you're a
glutton for tricky assembly programming, or have a very small and
specific problem, it's hardly worth designing the hardware if you don't
have a way to compile code to use the parallelism presented by the
hardware. Indeed, some LIW design suites (IIRC the Philips LIFE) allow
you to pour C code through a compiler and simulator and configure the
number and kind of function units to best trade off performance and area
against the specific computation kernel you care about.
On to FPGA VLIW CPU issues. Here are some quick comments.
To summarize my opinions, is a 2-issue machine is straightforward and
even worthwhile, whereas (depending upon computation kernel) a 4+ issue
machine may well bog down in its multiplexers and therefore may not make
the best use of FPGA resources.
Instruction fetch and issue
No sweat. Store instructions in a block RAM based instruction cache.
Each BRAM in Virtex derived architectures can read 2x16=32 bits each
cycle; each BRAM in Virtex-II can read 2x36=72 bits each cycle. Use 2-4
BRAMs and you can easily fetch 100-200 bits of instruction word each
Keeping the I-cache fed is another matter, left as an exercise.
Register file design
The key design problem is the register file organization. If you are
going to issue n operations each cycle, you must fetch 1-2*n operands
and retire up to n results each cycle. That implies a 1-2*n-read
n-write per cycle register file.
As a rule of thumb, in a modern LUT based FPGA, if a pipelined ALU
operation requires about T ns, then you can read or write to a single
port LUT RAM (or read and write to a dual port LUT RAM) in about T/2 ns.
(In Virtex-II, T is about 5 ns).
Let's discuss an n=4 operation machine. It is a challenge to retire 4
results to a LUT RAM based RF in T ns. Sure, if the cycle time balloons
to 2T, you can sequentially retire all four results in 4*T/2 ns, and
read operands in parallel using dual port LUT RAM.
Thus in a fully general organization, where any of the four results can
retire to any four registers, "it can't be done" in T ns as defined
Instead, consider a partitioned register file design. For instance,
imagine a 64-register machine partitioned into 4 banks of 16 registers
each. Each cycle, one result can be retired into each bank. That is
easily targeted to a LUT RAM implementation. Indeed, you can fix the
machine such that each issue slot is associated with one write port on
the register file.
We can easily issue four independent instructions such as
r0=r1+r2; r16=r17+r18; r32=r33+r34; r48=r49+r50
One $64 design question, properly answerable only with a good compiler
at your side, is how to then arrange the read ports. At one extreme, if
each of the four banks is entirely isolated from the other, (like a
left-brain/right-brain patient with their corpus collosum cut), then you
only need 2 read ports on each bank. In this organization, if you need
a register or two from another bank, you would typically burn an issue
slot in that sourcing bank to read the registers, and perhaps
another in the destination bank to save the registers. (Alternately in the
destination bank you can directly mux it into the dependent operand
At the other extreme, if any of the four banks can with full generality
read operands from any of the other three banks, e.g.
r0=r1+r2; r16=r3+r4; r32=r5+r6; r48=r7+r8 // cycle 1
r0=r17+r18; r16=r19+r20; r32=r21+r22; r48=r23+r24 // cycle 2
r0=r33+r34; r16=r35+r36; r32=r37+r38; r48=r39+r40 // cycle 3
r0=r49+r50; r16=r51+r52; r32=r53+r54; r48=r55+r56 // cycle 4
Then each of the four banks would need to provide 8 read ports, and each
of the 8 operand multiplexers (in front of the four ALUs) would need at
least a 4-1 mux.
You can of course provide all these ports by replicating or
multi-cycling the register file LUT RAMs, but it won't be pretty.
In practice, I believe that the expected degree of cross-bank register
reading is limited. Maybe you only need 3 read ports per
bank, perhaps 1 1/2 for that bank's ALU and perhaps 1 1/2 for other
slots' accesses. Again you need a scheduling compiler to help you
make these tradeoffs.
By the way, in my 1994
with the sketch of the NVLIW1 2-issue LIW, I used exactly this partitioned
register file technique, combining two banks of 3r-1w register files.
For each issue slot, one operand was required to be a register in that
bank; but the other operand could be a register in either bank.
Another option is to combine multibanked registers with some heavily
multiported registers. For instance, assume each issue slot can read or
write 16 bank registers and 16 shared registers, subject (across all issue
slots) to some maximum number of shared register file read and write port
accesses each cycle. The designers of the Sun MAJC used a similar
Assume a design with 4 ALUs, one branch unit, one memory port.
Loads/stores: Perhaps you limit generality and only allow loads/stores
to issue from a given slot (and then shuttle those results to other slot
register banks using the above multiported reg file description). The
general alternative, allowing any slot to issue loads/stores, requires
you to mux any ALU output to be the effective address, and a per-bank
mux to allow the MDR input to be the result.
Result forwarding (register file bypass): if an operation in slot i,
cycle 1, is permitted with full generality, to consume the result of an
operation from slot j, cycle 0, then you need n copies of an n:1 result
forwarding mux. Again, this is very expensive, so you will be sorely
tempted to reduce the generality of result forwarding, or eliminate it
entirely. Again, this is a design tradeoff your compiler must help you
You want to reduce the number of branches. In average C code you can
expect a branch every five (rule of thumb) generic RISC instructions or so.
In a four issue machine you will spend your life branching.
Instead, you will want to use some form of predicated execution. Some
computations will set predicate bits that represent the outcomes of
conditional tests. Then other instructions' execution (result
write-back and other side effects) will be predicated upon these
specified predicate bits.
In this organization, it seems reasonable to allow any issue slot to
establish predicates that can be used by any other issue slot; but for
simplicity you will only need and want one, or perhaps two, of the issue
slots to be able to issue (predicated) branch and jump instructions.
You will want a compiler to help you design the specific details of the
(By the way, predicate bit registers are not the only approach
to predicated execution...)
The sky's the limit. It is easy to build an 8- or 16- or 32-issue VLIW
in an FPGA. That's just stamping out more instances of the i-cache +
execution unit core slice. Whether the resulting machine runs much
faster than a much simpler 2-issue LIW is critically dependent upon your
As an aside, I'll just briefly mention Wulf's
WM, a simple architecture
with many interesting features, but of particular interest here, each execute
instruction is of the form
rd = (rs1 op rs2) op rs3 ,
an explicitly parallel representation that affords some additional
parallelism over the canonical scalar RISC, but which does not increase
the number of write ports required on the register file.
Embedded Systems Conference
I'm planning to head down to
On Tuesday, I'm looking forward to the panel, hosted by Jim Turley, titled
Drawing the Line Between Hardware and Software (scroll down).
"Two trends seem obvious, yet both are at odds with each
other. As embedded systems become more complex, they require more
programming. Software adds features, functions, and differentiation. But
at the same time, custom ASIC and SoC hardware are on the rise. Custom
chips are becoming far more common, replacing complex algorithms with
special-purpose hardware. Going forward, which will be the best way
to add value to a product: generic hardware with lots of software, or
specialized ASICs with little code overhead, or the entirely programmable
platforms such as FPGAs or reconfigurable logic? Is it easier, cheaper,
and faster to program gates or bytes? This panel will explore the blurring
distinction between hardware and software."
Speaking of which: Jim Turley, Embedded Systems Programming:
The Death of Hardware Engineering.
This ought to be fun:
Challenge the Expert.
"Tim Allen, Altera's chief architect of the Nios embedded processor
will be accepting challenges on the amount of time it takes to build a
complete system on a programmable chip (SOPC)."
That reminds me.
As early as ten years ago, in the software development world, at
conferences like SD, we have had the
programming equivalent of a bake-off, a live programming contest
in which teams of developers (e.g. best of the best from Borland,
Microsoft, and Symantec) would be given one or more problems and
would race each other to produce solutions as quickly as possible,
using their dev tools, libraries, and so forth. Here for example
is this year's
SD Developer Bowl.
[upated 3/26/02]: No, oops, that sounds like a trivia contest. Well,
take my word for it. Those old bake-offs were great.
Wouldn't it be pleasant to see engineers from Xilinx, Altera, Celoxica,
Triscend, Atmel, Cypress, Quicklogic, Chameleon, etc. up on stage, racing
against time to build solutions with their combination of soft and hard cores,
dev tools, system builders, and what have you?
One slow place-and-route step could ruin your whole day.
Multiprocessors, Jan's Razor, resource sharing, and all that
Continuing our multiprocessor (MP) discussion of Saturday.
Today's theme: Compared to uniprocessor design, building a large-N (N>20)
multiprocessor out of compact, austere soft processor core + BRAM
processing elements (PEs) is challenging and ... liberating!
Liberating in the sense that the hard trade-offs are easier to
make due to Jan's Razor (to coin a phrase), the principle that
once you achieve a minimalist instruction set that covers your computation,
any additional functionality per processor is usually unprofitable
and therefore rejected. This drive to leave things out is moderated
by the new opportunity, unique to MPs, to apply resource factoring to
share instances of critical but relatively-infrequently-used resources
Permit me to explain.
The greater part of processor and computer system design is deciding what
features to put in, and what features to leave out.
You include a feature if its benefits outweigh
its costs. Usually these costs and benefits are measured
in terms of throughput, latency, area, power, complexity, and so forth.
Designers typically focus on throughput, that is,
the reciprocal of execution time.
In a uniprocessor, the execution time of a computation
is the product of no. of instructions (I), times the average no.
of cycles per instruction (C/I or CPI), times the cycle time (T).
You can reduce the execution time by:
The trick, of course, is to reduce some of I, CPI, or T, without
increasing the other terms! For example, if you add hardware to do
sign-extending load byte instructions (reducing I) you must be careful
not to increase the cycle time T.
- reducing I: provide more powerful, more expressive instructions and/or
function units that "cover" the computation in fewer instructions;
- reducing CPI: reduce penalties, interlocks, overheads, and latencies (branches, loads); or
- reducing T: pipeline deeper, and/or replicate resources instead of sharing them.
But in a multiprocessor running a parallelizable computation,
we can spread the work out over NPE processing elements,
so the execution time equation (in ideal circumstances) approximates
time = I * CPI * T / NPE
So here is how this manifests itself at design time, and where "liberating"
comes into the picture.
Say you have a very austere PE CPU. This processor provides enough
primitive operations to "cover" the integer C operator repertoire, in
one or more instructions. For example, like the GR CPUs, your machine
may provide AND and XOR natively, but implement OR through
A OR B = (A AND B) XOR A XOR B. Or, it may have single-bit shift left (ADD)
and right (SRx), but lack multi-bit shift instructions. Or, it may
implement multiply as a series of adds.
Now, as a designer, imagine you are staring at a benchmark result, and you see
that your benchmark would run 5% faster if your PE had a
hardware (multi-bit) barrel shifter function unit.
You know that a barrel shifter would add (say) 64 LUTs to your 180 LUTs
processor. Maybe you even do a little experiment, and try adding the
barrel shifter to your design, to verify that the larger CPU datapath
does not inadvertently impact cycle time.
So, do you add the feature or leave it out?
In a "pedal to the metal" uniprocessor, you might indeed choose to grow
your processor by 30% to get a 5% speedup. Especially if that change would
help to keep your product ahead of your rival's offering.
But in a chip multiprocessor, where the entire die is tiled with PEs, if
you make your PE 30% larger, this could well reduce NPE by 30%, or more.
(Or it might not. It's a nonlinear function and some other resource,
like BRAM ports, might be the critical constraint.)
When n% larger PEs implies n% fewer PEs
Against this backdrop of "n% larger PEs implies n% fewer PEs", many
"conventional wisdom" must-have features, such as barrel shifters,
multipliers, perhaps even byte load/store alignment logic, are indefensible.
The problem for our friend the barrel shifter is this. Even if
it dramatically speeds up a bitfield extract operation, from 10 cycles
down to 1 cycle, there just aren't (dynamically) that many shift
instructions in the instruction mix. So the total speed up (as posited
above) was only 5%.
Compare the barrel shifter multiplexer LUTs to the other parts of the
CPU. The instruction decoder, the register file, the ALU,
the PC incrementer, the instruction RAM port -- these are used nearly
every darn cycle. Putting these resources in hardware is hardware
well spent. Lots of utilization. But our friend the barrel shifter
would only be used maybe every 20 instructions.
Overall, the multiprocessor will have higher aggregate throughput
if you leave out these occasional-use features and thereby
build more PEs per die.
Indulge me to name this principle Jan's Razor.
In a chip multiprocessor design, strive to leave out all but the minimal
kernel set of features from each processing element, so as to maximize
processing elements per die.
Fractional function units, resource factoring and sharing
Yet as we design multiprocessor systems-on-chip, we have a new
flexibility, and the barrel shifters of the world get a reprieve.
When designing a uniprocessor, you essentially are performing
an integer (as opposed to linear) programming problem.
You must provide some integer 0, 1, 2, etc. ALUs, shifters, memory ports, etc.
But in the context of a multiprocessor, you can provide fractional
resources to each processing element.
In our scenario above, we really wanted to add, say, 1/10
of a barrel shifter to our processor -- something that ran at
full speed but only for one cycle in every ten, on average,
at a cost of 1/10 of the area of a full barrel shifter.
You can't do that in a uniprocessor.
You can do that in a multiprocessor.
Imagine you have a cluster of ten tiny, austere PEs, and assume
these PEs already have some kind of shared access bus, for example
for access to cluster-shared-memory. To the cluster, add a shared
barrel shifter, accessed over this same shared access bus. Now,
on average, each PE can issue a fast shift instruction every tenth
instruction, for a cost per PE of only 1/10th of a barrel shifter.
Continuing with this example, imagine the ten processors are running
a computation kernel that needs to issue a multiply about every five
instructions. Here we add two multipliers to the cluster.
Statistically the multipliers will run hot (fully utilized),
and each PE will run about as fast as if it had a dedicated multiplier,
but at a cost of only about 1/5 of a multiplier per PE.
Here we see this fractional function unit approach also provides
a new dimension of design flexibility. Depending upon the computational
requirements (e.g. the inner loops you find yourself running today),
one may be able to add arbitrary numbers of new
function units and coprocessors to a cluster without disturbing
the carefully hand-tuned processing element tiles themselves.
Rethinking multiprocessor architecture
Once you start really thinking about resource factoring and sharing
across PEs in a multiprocessor, it turns a lot of conventional
wisdom on its head, and you look at old problems in new ways.
In particular, you start inventorying the parts of a CPU that are not
used every cycle, and you start asking yourself whether each processing
element deserves an integral or, rather, a fractional instance
of that resource.
Such occasional-use resources include barrel shifter and multiplier of course,
but also load/store byte align hardware, i-cache miss handling logic,
d-cache tag checking, d-cache miss handling, and (most significantly) an MMU.
Let me address the last few of these. Loads and stores happen
every few cycles. Maybe each PE doesn't deserve more than one third
of a data cache! ... Perhaps it is not unreasonable to have a single
shared d-cache between a small cluster of processors.
The MMU is another example. Even if each PE has its own private i-cache
and d-cache, if these caches are virtually indexed and tagged, then
only on a cache miss do we need to do an MMU address translation and
access validation. If the average processor has a cache miss (say) every
twenty cycles, and if an MMU access can issue each cycle, then it is
not unreasonable to consider sharing a single MMU across ten PEs
in a cluster.
It is wrong to design chip multiprocessors by tiling together
some number of instances of die shrunk monolithic uniprocessor cores,
each with a full complement of mostly-idling function units.
A more thoughtful approach, achieving fractional function units
by careful resource sharing, would appear to yield significant
PE area reductions, maximizing overall multiprocessor throughput,
and providing the system designer with a new dimension of design
announces it is now shipping
"High-volume pricing (25,000 units) in 2004 for the XC2VP4, XC2VP7, and XC2VP20 devices is $120, $180, and $525, respectively."
V2Pro is a blockbuster product, combining the blazing fast Virtex-II
programmable logic fabric with 0-4 embedded PowerPC 405s and 4-16 embedded
3.125 Gb/s serial links, and also (approximately) doubling the
average number of block RAMs and multipliers per LUT (compared with Virtex-II).
Virtex-II Pro Handbook.
Our coverage one year ago.
Here are the new devices in context.
Device Array KLUTS BRAMs Lnks PPCs BR/KL
2V40 8x 8 .5 4 0 0 8.0
2V80 16x 8 1 8 0 0 8.0
2VP2 16x22 3 12 4 0 4.0
2V250 24x16 3 24 0 0 8.0
2VP4 40x22 6 28 4 1 4.7
2V500 32x24 6 32 0 0 5.3
2VP7 40x34 10 44 8 1 4.4
2V1000 40x32 10 40 0 0 4.0
2V1500 48x40 15 48 0 0 3.2
2VP20 56x46 19 88 8 2 4.6
2V2000 56x48 22 56 0 0 2.5
2V3000 64x56 29 96 0 0 3.3
2VP50 88x70 45 216 16 4 4.8
2V4000 80x72 46 120 0 0 2.6
2V6000 112x104 67 144 0 0 2.1
Array: base CLB array (rows x cols)
KLUTs: thousands of LUTs (rounded to nearest thousand)
BRAMs: no. of 18Kb dual port block RAMs
Lnks: no. of 3.125 Gb/s serial links
PPCs: no. of PowerPC 405 embedded hard CPUs
[Source: analysis based upon data from Virtex-II and Virtex-II Pro data sheets]
(See also Marketing gates redux.)
Notice that for Virtex-II Pro, the number of LUTs is not simply
the CLB rows times columns times eight, since some CLBs are displaced
by "immersed" IP.
With all the excitement about the embedded PowerPCs and
gigabit links, analysts will overlook the huge improvement
that these newer devices have relative to Virtex-II:
an approximate doubling of the number of BRAMs per LUT
in the larger devices.
(2VP50 has more than twice the BRAM per LUT of the 2V6000.)
As noted last year:
"I had been disappointed that the larger Virtex-II devices seemed relatively block-RAM-port poor. For example, the Virtex-II data sheet states the 120-CLB-column '2V10000 will have only 6 columns of block RAMs, or on average, only one column of block RAMs per 20 columns of CLBs."
Indeed Virtex-II Pro devices do appear to have about one BRAM per each
4 rows by 6 column tile of CLBs, and that is a near-perfect match for
a multiprocessor made of tiles of processing elements, one per 24 CLBs
and one BRAM.
"But at 6 CLB columns per block RAM column, this monster FPGA assuages this concern -- and is quite reminiscent of the generously-RAM-endowed Virtex-EM family."
Anthony Cataldo, EE Times:
Xilinx enhances FPGAs with embedded PowerPCs.
Crista Souza, EBN:
Xilinx counters Altera with new FPGA offering.
Loring Wirbel, Communications System Design:
Xilinx Tiles PowerPC and I/O on Virtex-II.
Anthony Clark, EE Times, UK:
Xilinx begins sampling PowerPC-based FPGA.
Murray Disman, ChipCenter:
Virtex-II Pro Arrives. Includes a helpful comparison of the Virtex-II Pro and Excalibur embedded processor and bus architecture choices.
Large-N Chip Multiprocessors in FPGAs|
On the fpga-cpu list an anonymous correspondent asked "Does anybody try to design multiprocessor in FPGA?"
60 RISC CPUs in one V600E
Last spring I designed (through to place-and-route) 60 16-bit pipelined
RISC gr1040i's in one XCV600E, and also 36 32-bit pipelined gr1050i's in
one XCV600E. Sorry, a trip got in the way of finishing the project
and writing up the results.
gr1040i 16-bit, 8Rx6C CLBs, 2 BRAM ports 12.5 ns (80 MHz)
gr1050i 32-bit, 16Rx6C CLBs, 4 BRAM ports 15 ns (67 MHz)
(alt. 8Rx10C CLBs)
The above cycle times get derated by 1-2 ns when you build a big array of
processing elements -- PAR gets slow and stupid on big chips, even
though the above processors' datapaths were completely hand technology
mapped and floorplanned.
For the 60-way MP, each processor got a private 512 B BRAM for its code
and data, and each cluster of 5 processors on a half-row shared a sixth
512 B BRAM for shared memory. The unfinished part of the design was a
switch so that any processor in cluster[i] could access the shared
memory BRAM of cluster[j].
bB PA0 PA1 BB PA2 PA3 BB PA4 F6 F6 PG4 BB PG3 PG2 BB PG1 PG0 Bb
bB PB0 PB1 BB PB2 PB3 BB PB4 F6 F6 PH4 BB PH3 PH2 BB PH1 PH0 Bb
bB PC0 PC1 BB PC2 PC3 BB PC4 F6 F6 PI4 BB PI3 PI2 BB PI1 PI0 Bb
bB PD0 PD1 BB PD2 PD3 BB PD4 F6 F6 PJ4 BB PJ3 PJ2 BB PJ1 PJ0 Bb
bB PE0 PE1 BB PE2 PE3 BB PE4 F6 F6 PK4 BB PK3 PK2 BB PK1 PK0 Bb
bB PF0 PF1 BB PF2 PF3 BB PF4 F6 F6 PL4 BB PL3 PL2 BB PL1 PL0 Bb
The floorplan (and the interconnect design, etc.) were completely
determined by the V600E's arrangement of BRAMs. For compact-area soft
cores, the BRAM number and placement is the greatest determinant of how
many processors you can implement. The V600E is a sweet spot, a lovely
balance of CLBs and BRAMs.
- Recall an XCV600E is a 48R x 72C FPGA with 72 512 B BRAMS in (12R x 6C).
- P[c][i]: 16-bit pipelined RISC processing element in cluster c, column i; each uses 8Rx6C of CLBs
- F6: 8Rx6C of "free" unused CLBs (pending the inter-cluster switch interconnect design).
- B: one 512 B block RAM used for private instruction and data memory for one processing element
- b: one 512 B block RAM used for shared data memory for one cluster of 5 PEs
on that half-row (e.g. b at top left shared by PA[0:4]).
The programming model was to have been one thread per PE with separate
private memory and shared memory address spaces -- not message passing,
etc. (I believe shared memory multiprocessors are the easiest to
parallelize dusty deck C code against even though they're harder to
scale up into the thousands of processors).
"Parallelism: the new imperative."
-- Jim Gray, Microsoft
FPGA CPU News, Vol. 3, No. 3
Back issues: Vol.3 (2002): Jan Feb; Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.