Jan Feb Mar
Apr May Jun
Jul Aug Sep
Jan Feb Mar
Apr May Jun
Jul Aug Sep
Oct Nov Dec
Apr Aug Sep
Oct Nov Dec
So that's what a magnitude 6.8 earthquake (40 miles away) feels like.
Impressions of Mercury|
More on the new 1.8V Altera Mercury family, based upon my impressions of
information in the
First let's compare this family, which offers 4800-14400 LEs,
with APEX 20KE devices offering 1200-51000 LEs. My take is that Mercury
should be regarded as complementary to other Altera families.
Or is it a peek at the shape of things to come?
It's interesting that the Mercury family does not seem to use the MegaLAB
architecture of the 20K family (per se). For example, the ESBs are
schematically located at the top and bottom of the device rather than
distributed about the fabric.
Compared with earlier Altera families, Mercury seems to have an additional
category of fast horizontal inter-LAB interconnect called RapidLAB.
"The local interconnect can drive LEs within the same LAB or adjacent LABs.
This feature minimizes use of the row and column interconnects, providing
higher performance and flexibility. Each LAB structure can drive 30 LEs
through fast local interconnects."
I am not surprised. As proposed FPGA architectures are increasingly
validated and tuned by recompiling existing customer designs against
them, perhaps over time all of the various FPGA architectures will evolve
into the same basic topology, varying only on the way the hierarchies
of local, intermediate, and global interconnect are "chunked".
So for example, we'll have Virtex-II with 8 LUTs per CLB competing with
10K/20K/Mercury at 10 LEs per LAB. And now we'll have a more "local"
extra-LAB interconnect in Mercury.
On another tangent, I wonder if Altera's "Advanced Redundancy"
yield enhancement technology
is applicable with this new RapidLAB and Leap Line local inter-LAB
Arithmetic now uses an interesting carry-select lookahead architecture,
which treats each 4-LUT as 4 2-LUTs. I suppose I still prefer the
Virtex architecture, which makes it possible to implement
o[i] = addsub ? (addf1 ? a[i]+b[i] : a[i]-b[i])
: (addf1 ? f1(a[i],b[i]) : f2(a[i],b[i]))
in one 4-LUT per bit.
The Mercury multipliers are not separate dedicated multipliers
(as in Virtex-II) but rather are built from programmable logic LEs,
aided by a dedicated multiplier mode for forming partial products
and adding them together (binary tree adder).
Quad port ESB RAM
Figure 17, the "ESB Quad-Port Block Diagram", diagrams a 4-port
RAM with two read and two write ports, and indeed with four sets
of address lines and control lines.
In this mode, the ports can only be up to x16. Nonetheless if this mode
is fast, it could make a nice building block for a two-issue
superscalar RISC, using two ESBs to provide a 2-write 4-read register
file. (One challenge in implementing such architectures is retiring two
results per cycle.)
Compare this to Virtex/II, where you can write two locations per cycle
but only read back what you wrote -- or optionally in -II, what used
to be at those overwritten locations. In contrast, with Mercury
ESBs you seem to be able to read back two other arbitrary locations.
Just as with Virtex, the write port can use a different data width
than the read port. This is essential for providing support for
SERDES -- receive and deserialize a high speed port into a RAM-based FIFO,
process the data at lower frequency but wider width, then deposit
results into another variable-width FIFO to serialize and transmit it.
Unfortunately, the Mercury data sheet does not seem to specify the
effect of simultaneously reading/writing a memory cell on more than one port.
It is also notable that the ESBs provide a TurboBit to run less speed
critical RAMs at lower speed and lower power.
"ESBs can implement synchronous RAM, which is easier to use than
asynchronous RAM. A circuit using asynchronous RAM must generate the RAM
write enable (WE) signal while ensuring that its data and address signals
meet setup and hold time specifications relative to the WE signal."
Hmm. Is this marketing? If so, I don't think it's effective. As I recall,
the 1995 XC4000E ended the XC4000-era of WE glitch generators.
Soliciting guest commentary
Although it doesn't happen much, we're open to publishing
relevant and interesting guest commentary here.
For example, if you work at Altera and wish to comment
or elaborate upon this commentary, or fill in the details, please
drop me a line or send your
comments to the fpga-cpu list.
Murray Disman, ChipCenter:
Altera Ships Mercury Family.
This appears to be Altera's response to Virtex-II/Pro, and it has some
interesting features. Besides adding 8-18 1.25 Gb/s channels, Mercury
provides up to 100 8x8 "Distributed Multipliers". But perhaps of most
utility to FPGA CPU designers:
"Altera's Mercury devices include embedded memory via new quad-port embedded system blocks (ESBs) each of which contains 4,096 programmable bits and can support up to four independent operations at once."
David Lammers, EE Times:
Altera chips join PLD with gigabit transceiver.
In the Teaching dept., here's a fun write-up from 1998 of
four New Mexico Tech CS331 students' experiences building FPGA CPUs:
How the Puerco was born.
This is so good and echoes so many of our themes that
I can't help but quote three whole paragraphs:
"The hardest part was making our CPU fit on the chip. Our first
synthesization produced something that used up %400 of our CLB's
(combinational logic blocks). We minimized and minimized and threw out
every extraneous bit of logic and went from 32 bits to 9 bits, the minimum
needed to hold our opcode and any useful number of addresses. Once we
got the 9-bit CPU to fit, we began discovering all sorts of sneaky ways
to avoid using CLB's. Eventually, we got the whole pipelined 32 bit CPU
on the chip, which made us ecstatically happy. The whole experience was
good preparation for the frustration that must be found in, say, tax law,
or counting poppy seeds."
"From Puerco Jr. to Puerco. The entire time we working on the Puerco Jr.,
our non pipelined CPU, we were dreading the Puerco. Pipelining sounds
hard. It must be more difficult to design a processor that runs three
instructions at the same time than one."
"Then we actually started designing it. It was easy! Basically, it did
the same thing every cycle instead of different things each cycle. The
way we handled invalid instructions was to simply add an invalid bit to
each stage. If the instruction was valid, the results got written out
to memory or latched into a register. If it wasn't valid, the results
of that stage just got ignored. Piece of cake."
(See also Puerco links from Ben Sittler's old
Clearly these former undergraduates learned some things that are not
taught in any textbook nor observed in any simulator.
(But note also the "average of 20 hours per person per week" times
four students. Ouch, that's a tough workload. In my teaching paper I argue
that a well designed course can provide students a working framework
so that the total workload need not be so oppressive.)
[update 02/19/01] Henson:
"The actual design took us only a couple of weeks, figuring out how to
just plain use the software took us much longer. ...
The most useful way to decrease the coursework load
would be a short training session with the software, better documented
software, and better software, period. Designing the state machine
and physical layout of the CPU in each group individually was
definitely worth the time. Since we had to use what we designed, we
put a lot more thought and effort into making our designs simple and
robust. The rest of the semester was spent cussing at the
Yet more FPGAs in EE Times
Bernard Cole, EE Times:
Programmable-chip methods get fresh look.
Anna S. Chiang, Altera, in EE Times:
Programming enters designer's core.
An interesting, exhaustive list of requirements for a development platform
for an embedded processor with programmable logic, as exemplefied by
the Altera Excalibur program, including its Quartus II with SoPC Builder.
The Xilinx XtremeDSP/Virtex-II Simulcast is now available as a series of
sessions with accompanying PDF slide sets.
My earlier comments.
I highly recommend viewing Erich Goetting's presentation
on Virtex-II. There's some great motivation on features like
the XCITE controlled impedance technology, the IP Immersion architecture,
and so forth.
After registering with Xilinx, you can also download the corresponding
slide set, named module7.pdf.
The shape of things to come
In particular, I would like to bring to your attention slide 83 of 88.
It shows a diagram of a huge FPGA with four embedded PowerPC cores and
what appear to be 12 (top edge) + 12 (bottom edge) Conexant 3.125 Gb/s
serial link cores -- which would be about 75 Gb/s/chip of link bandwidth.
(Memory) surpluses out to 2010
Zooming way in, this diagram appears to depict a monster 136x104 CLB part,
which would be well over 100,000 logic cells, and
apparently with 18 columns of block RAMs and multipliers (apparently
556 in all) and about six columns of CLBs per block RAM.
I had been disappointed that the larger
Virtex-II devices seemed relatively block-RAM-port poor.
For example, the Virtex-II data sheet states the 120-CLB-column '2V10000
will have only 6 columns of block RAMs, or on average, only one column
of block RAMs per 20 columns of CLBs.
But at 6 CLB columns per block RAM column, this monster FPGA
assuages this concern -- and is quite reminiscent of the generously-RAM-endowed
Indeed, the apparent 556 18 Kb block RAMs would total about 1.2 MB of
block RAM, and might offer a total bandwidth of about
556 * 2 * 36 b at (say) 200 MHz = 8 Tb/s!
Immersed IP footprints
Each PowerPC core appears to displace 4 rows by 2 columns of block RAMs, plus
apparently 16 rows by 2+6+2 columns of CLBs.
Each serial link core appears to displace one single
block RAM and hardware multiplier block.
500 soft CPUs per chip?
At apparently 4 rows by 6 columns of CLBs per block RAM, this could be just about
the perfect pitch for those of us with delusions of chip multiprocessors,
since our area optimized 16-bit CPU core (which requires uses one block RAM)
should use 4 rows by 6-7 columns of CLBs in Virtex-II.
In such a device without IP immersion, this would yield about 34 rows by 16 columns of processors = 544 16-bit CPUs per die.
Subtracting the areas partially covered by serial-link or processor
hard cores (spanning apparently 13 of "my soft processor core tiles"
per quadarant) would leave about 492 simple 16-bit CPUs per monster
Since the area of a Virtex-II-optimized compact simple 32-bit RISC core
will be about 8 rows by 6-8 columns of CLBs, and since each PowerPC core
seems to displace twice that many CLBs, we obtain this counterintuitive
rule of thumb:
one streamlined 32-bit soft CPU core optimized for programmable logic
might need only half the silicon area of an elaborate 32-bit hard
It's not apples to oranges -- the PowerPC hard core runs much faster,
has much more cache memory and many additional instructions and features
-- but it does kind of turn conventional wisdom on its ear!
Put on your thinking caps
Whether this diagram depicts a hypothetical planned device,
a trial balloon, a clever misdirection, or something else, does not matter.
It is clear that this shows some flavor of the shape of
things to come. It's time to start thinking imaginatively
about how to best use such a monster -- not to mention a rack full of them.
In some ways, a 500 CPU MIMD or a 1000 CPU SIMD per chip,
or even a 100-trillion-instructions-per-second
1,000,000 CPU MIMD (20 boards, 100 chips per board,
500 CPUs per chip, 100 MIPS per CPU) is just about the
least imaginative use possible.
Mind boggling stuff.
FPGA CPU News, Vol. 2, No. 2
Back issues: Vol. 2 (2001): Jan; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.