Jan Feb Mar
Apr May Jun
Jul Aug Sep
Jan Feb Mar
Apr May Jun
Jul Aug Sep
Oct Nov Dec
Apr Aug Sep
Oct Nov Dec
Ken Chapman's new Xilinx techXclusive series, part one:
Creating Embedded Microcontrollers
(Programmable State Machines).
"I hope you will be inspired to create your own application-specific PSM processors as well as find new applications for existing PSM macros."
Earlier KCPSM coverage.
Programmable World 2002 observations
I attended the Bellevue, WA satellite downlink.
I found this to be a somewhat disappointing seminar. Some sessions
were not technical enough, while others were curiously not oriented
to FPGA designers.
(See also Greg Neff's review.)
The first keynote was a rah-rah "internet exponential growth"
talk that might as well have been given two years ago, before the bust,
and even back then would have left us engineers hungry for some
scraps of hard technical content.
Nick Tredennick on the other hand was thought provoking (as usual).
Once again, the highlight was the presentation by Erich Goetting on
Virtex-II Pro. It was somewhat of a repeat of
earlier stunners but still
there were several interesting disclosures that I haven't seen elsewhere.
(If you haven't already poured over the
Virtex-II Pro Handbook, go off and do so, and then come back here
when you have more context to interpret the following trivia.)
(The following points are transcribed from hastily typed notes
I took during the talk -- please send corrections if necessary.)
I was surprised that the PowerPC talk that followed was so focused on
the IBM Microelectronics' ASIC/ASSP/CSSP PowerPC business without
spending that much time exploring the new opportunities inherent in the
Xilinx PowerPC + FPGA platform.
The 1.5V V2Pro is designed for 130 nm, 9 layer metal,
all Cu, low-K dielectric, with 92 nm gate lengths, and
22 angstrom gate oxide thickness.
The RocketIO MGT (multi-gigabit serial transceivers) contain hard logic
equivalent to 50K "ASIC gates" plus addtional analog circuitry.
Since the 300 MHz, 456 D-MIPS, PowerPC 405 core uses just 0.9 mW/MHz
(0.59 mW/D-MIPS), it's a "low Power PC", if you will.
Indeed, Goetting humorously compared the 456 D-MIPS embedded PPC
to an "acre full" of (1 D-MIPS) DEC VAX-11/780s. Later he compared
the 100 mW used by a typical LED and the 169 D-MIPS of 405 computation
that would use the same amount of power -- and compared that to the 157
D-MIPS of horsepower of the original Cray-1
(an apples to oranges comparison if I've ever heard one.)
(including the CPU, 16 KB I- and D-caches, MMU, etc.)
occupies 3.8 mm2, or 2% of 2VP50,
(and displaces approximately 1000 LUTs),
so I suppose we can infer a 2VP50 is approximately 200 mm2.
Xilinx will provide a Data2BlockRAM tool to insert compiled code and other
data into the block RAM initialization bitstream.
(It was unclear to me whether you can also arrange to pre-initialize
the 405's I-cache and D-caches via the configuration bitstream, for
those applications content to boot out of, or run entirely out of,
the core's caches.)
There will be a System Generator for PowerPC later this year.
I'd wager that over the next few years, IBM will gather considerably
more CoreConnect licensees, partners, tools, and more CoreConnect
reusable IP, based upon this Xilinx alliance, than they've
seen in the whole history of embedded PowerPC ASIC products.
Conversely, I was left with the impression that John Fogelin of
Wind River Systems really appreciated the potential of these new
system platform FPGAs and also recognized the new challenges facing
engineers. I thought he did a good job laying out
Wind River's value proposition to FPGA SoC designers.
More on RocketIO MGT links
One of the talks I attended was on using the MGT links for building
serial backplanes. Here the recommendation was to use a proven
soft core to simplify interfacing to the rich and nontrivial
MGT hard core. For example, you might use the forthcoming
Aurora protocol interface (350 slices)
to handle packet framing, data alignment, etc.
(The Aurora part of the presentation was rather light on details
-- specific control signals and so forth -- I was left with the impression
that it is still being designed.)
Or, you might use Xilinx's XAUI interface cores and thereby
speak 10 Gb ethernet to XAUI switches in your backplane.
I learned some new things about the MGT links.
I was wondering about clock mismatch between transmitter and receiver.
Each MGT receiver has an elastic buffer. As I understand it,
if the transmitter is clocked slightly slower than the receiver,
the receiver's client may clock out more characters than have been
received. As the buffer drains, I understand the MGT receiver inserts
protocol-specific IDLE characters into the buffer to keep it
from underflowing. Presumably the protocol adapter interface
receiver soft core will then drop these IDLE characters as they
Each instance needs 9 pins -- TX+/- and RX+/-, of course, but also
AVCCAUXRX, AVCCAUXTX, VTTX, VTRX, and GNDA. These must accompanied
by (recommended) 4 ferrite beads and 4 capacitors.
We were told that in FF (flip chip BGA) packages the links can run
at the full 3.125 Gb/s speed, but not so in FG (regular BGA) packages.
This is puzzling since the links are also supposed to run reliably
through 20" of FR4 and a couple of backplane connectors.
Confirmation: see this Xilinx support forum
MGT links may use 350 mW each.
Apparently you can directly use the MGT transceivers as ethernet PHYs.
Conversely, if the transmitter is running a little faster than the receiver,
it may start to fill the elastic buffer faster than the receiver can drain it.
In that case, as I understand it, the elastic buffer will start to
delete protocol-specific IDLE characters from the buffer.
Ah, but how did those IDLE characters get into the buffer? I believe
the transmit-side protocol adapter has to insert a certain number of
IDLE characters, perhaps between packets, or otherwise, into the
stream of characters to be transmitted, so as to give the receiver's
elastic buffer something to drop in the event that the transmitter
is outrunning the receiver.
Is that right?
Another issue: I asked about latency through the MGT. Say you have one
MGT send just two words (8 bytes) to an adjacent MGT; and then user
logic at that second MGT sends the two words right back to the first.
What is the round trip time? I was told it may be some tens of user
clocks of latency before the first receiver sees the 8 bytes, and
then another some tens of user clocks before the data is received back
at the source.
Oh well, if that is so, it may reduce the utility of these links as
a low-latency interprocessor-cluster-interconnect in a scalable MP scenario.
High bandwidth yes, low latency, maybe not.
Is this information specified somewhere?
Xilinx is to be congratulated for democratizing these advanced technologies
and putting them and the tools needed to access them in the hands of
thousands of designers who are not necessarily "big companies".
Nevertheless, one is taken aback at the considerable detail and complexity
of this new system platform. One could reasonably absorb 90%
of the details of a the XC4000 programmable logic fabric in
a few hours. For Virtex, a few days might be required to also grok
BRAMs, DLLs, etc. For Virtex-II, more time. But in Virtex-II Pro,
there is tremendous flexibility, power, and yes, complexity, inherent
in interfacing to the embedded hard cores, there are mixed software
and hardware design scenarios, and so on and on.
Xilinx and its partners are going to be challenged to tie a neat bow
around these technologies, using reference designs, using complexity
busting "easy IP", using tools like their forthcoming System Generator
for PowerPC, so that engineers new to SoC design, embedded processors,
or high speed interconnects, can successfully apply all this great silicon.
That's a different kind of challenge than designing a great FPGA fabric
or a better place-and-route algorithm. It will be interesting to see
how they do.
(If I am not mistaken, the current best offering in this department is the
Virtex-II Pro Developer's Kit, $95.)
Will Apple's OS X ever be ported to a Virtex-II-Pro-based platform?
Might FPGA-based hardware media acceleration and/or reconfigurable
computing make a compelling platform for some future
Macintosh Media machine?
What is the next hard IP block destined for IP immersion?
Will demand ever lead Xilinx to field a Virtex-II-Pro+, which,
by analogy with Virtex-EM, contains a factor of 2-4x again more
Anthony Cataldo, EE Times:
FPGA vendors close in on 3.125-Gbit/s serial I/O.
'"There's no comparison between a standalone processor and one that's
immersed in an FPGA," said Kent Dahlgren, a member of the technical
marketing staff. "The bandwidth we have is far more important than [CPU]
Mips or megahertz."'
(Just as the bandwidth to
of instances of compact soft CPU cores
will dwarf the bandwidth to a handful of hard cores.)
SoC designers describe their 'best practices'.
'"Increasingly, in the future, we are going to see multiprocessor
SoC devices and multithreading cores."'
On the fpga-cpu list, Anand Gopal Shirahatti
"... What I was wondering is, are there are Implementations of the TCP/IP
Implementation over a Single FPGA, for mutilple connections. ..."
The simplest thing to do is run a software TCP/IP stack on a soft CPU core.
For example, at ESC I saw
TCP/IP running on uCLinux on Altera Nios with a CS8900A ethernet MAC.
Note that a compact FPGA CPU core with integral DMA (e.g. xr16)
may be hybridized into the data shovel aspect of an ethernet MAC.
(Flexibly shovel the incoming bits to/from buffers, etc.) Indeed, one
enhanced FPGA CPU might (time multiplexed or otherwise) manage several
You can also build hardware implementations of the TCP/IP protocol itself. There are several such implementations in custom VLSI. For FPGA approaches, see:
And related things:
Smith et al's XCoNet.
"The SiliconServer runs all normal TCP/IP functionality in state machine
logic with a few exceptions that are currently dealt with by software
running on the systems attached processor (e.g. ICMP traffic, fragmented
Legacy ISA soft cores in FPGAs?|
Peter Alfke of Xilinx
"...IMHO, both PowerPC and ARM are too complex to be implemented as soft
Implementations of integer subsets of MIPS, ARM, and PowerPC architectures
are not too complex to be implemented as soft cores. One can produce an
integer MIPS-I soft core as "small" as MicroBlaze; and I have done a
spreadhseet analysis/design study for an FPGA-optimized PowerPC Book I soft
core that cost between 1200 and 2000 LUTs (1.3-2.2x the size of MicroBlaze),
depending upon performance tradeoffs and whether or not you trap and emulate
certain rare and expensive instructions.
The only thing holding back fast (100 MHz) relatively compact (800-2000
LUTs) FPGA-optimized soft core implementations of subsetted commercial RISC
instruction set architectures is the intellectual property landscape.
I am surprised that certain processor IP companies, that lack a hard core
programmable logic platform, and may therefore be losing certain design wins
to ARM and PPC, have not yet launched soft FPGA-optimized processor core
products. Perhaps they too think it infeasible or impractical.
(Advertising: my company can help prove otherwise -- we may be available
to develop FPGA-optimized soft cores for processor IP licensors.)
I predict that sooner-or-later all processor IP licensors will come to the
realization that programmable logic has become the air that a great many of
their designers breathe, and that eventually all processor IP licensors will
offer or endorse FPGA-optimized soft processor core implementations of their
ISAs. To not do so would be to surrender a quickly growing market segment
to their competitors. I put that date around 2005.
There is no defense against the ATTACK OF THE KILLER FPGAS!
I also feel that binary translation (static or dynamic) will become
important and then commonplace, both as a way to run legacy ISAs on
streamlined FPGA-optimized cores, and as a way to run full ISAs on subsetted
Billion transistor FPGAs and defects
After yesterday's entry,
asked (on the fpga-cpu list),
"Now with that many transistors how is failing/defective
transistors/CLB's handled? Need one design error detecting
logic in the new cpu ISA's? While I know the decimal machines
of the 1950's often had error detecting codes like 2 bits out
5 that not only detected storage problems it detected alu
problems too. Is there anything simple for today's binary
machines in re-coding information for storage and arithmetic
to detect possible problems?"
Each and every one of those transistors test out "perfectly" at the
factory. I understand that the tester downloads a number of configuration
bitstreams that fully exercise and cover the configuration memory,
the CLBs, interconnect, etc.
(((Wacky idea: I understand that testers step over each die on the
unsawn wafer, pressing probe wires to the die's pads, powering it up,
and running some test circuits. I wonder, is it practical to add power,
ground, and JTAG-like test paths, between dice, to interconnect the
dice on the unsawn wafer and thereby test entire wafers in parallel?
You would still need to step the tester over each die to check out I/O
defects, but since most internal logic defects would already have been
diagnosed, the tester would not need to spend much time on known bad dice.
Then you collect the self-test and tester-based test results and saw
and keep the good dice, the EasyPath dice, etc.)))
Altera APEX: BTW in APEX parts, Altera
reportedly uses redundancy to improve yield and hence lower cost.
EasyPath: Since only a fraction of FPGA transistors matter for a given
configuration, the keen idea of the EasyPath product, as I understand
it, is to qualify partially defective dice against a fixed configuration
(or at any rate, a set of test configurations that covers the resources
required by the fixed configuration).
That said, factory perfect FPGAs may still have failures in the field.
Coping with those failures is a rich subject. Here are just a few
- You can use readback to read the configuration bitstream. You can
even read it back to an internal circuit within the FPGA. There you can
compute a signature on the bitstream and so detect if it has changed
through some kind of SRAM upset. You can even continuously readback
the configuration and test it is pristine every second (or more often
- In one FPGA you can build two or more processors, and run them in
lock step, comparing the write-back results of each processor each cycle.
This can detect when one diverges from the other. I really think this
is the easist thing to do, at least to detect faults.
You can also build a TMR system. (And I think I would have more confidence in a system done across three FPGAs than all on one.) And as in big systems you can always put EDAC (ECC) on the buses and/or RAMs in your system.
Designers of aerospace systems have to worry about this all the time.
See for example the
Here are PDF slides of Xilinx's Peter Alfke's talk,
Evolution, Revolution, and Convolution: Recent Progress in Programmable Logic.
(Xilinx techXclusives version noted here earlier).
It's quite Xilinx-centric, but still well worth reading,
chock full of important issues and good ideas.
"FPGAs circa 2005"
The theme of one of the best issues of IEEE Computer ever,
Sept. 1997, was The Future of Microprocessors.
The introductory article was
Burger et al, Billion-Transistor Architectures.
It seems very likely to me that the first billion transistor microprocessors
will be FPGA chip-multiprocessors.
- "50 Million system gates"
- "2 Billion transistors on one chip" (my emphasis)
- "70-nm process technology"
- "10-layer Cu technology"
- "Hard and soft IP blocks"
- "1 GHz embedded processor"
- "Mixed-signal Intellectual Property"
- "10-Giga-bps I/O channels"
FPGA CPU News, Vol. 3, No. 4
Back issues: Vol.3 (2002): Jan Feb Mar; Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.