fpgacpu.org - FPGA CPU News of March 2002

FPGA CPU News of March 2002

Home

Apr >>
<< Feb

News Index
2002
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
2001
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
2000
  Apr Aug Sep
  Oct Nov Dec

Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research

GR CPUs

XSOC
  Launch Mail
  Circuit Cellar
  LICENSE
  README
  XSOC News
  XSOC Talk
  Issues
  xr16

XSOC 2.0
  XSOC2 Log

CNets
  CNets Log

Sunday, March 31, 2002
Unfortunate Yahoogroups marketing preferences change
Our fpga-cpu list was originally hosted by egroups.com. They were acquired by Yahoo!, and since then, some members have subscribed via a Yahoo! account.
Please take a moment to read this:
'Yahoogroups just did something outrageous: they silently added something called "Marketing Preferences" to your user profile, and the default is opt-in for spam. ...'
If you have a Yahoo! account, I encourage you to review your "Marketing Preferences".

Saturday, March 30, 2002
Ron Wilson, EE Times: The Dirty Little Secret. New problems caused by die level process variations.
'"Look at interconnect," he said. "When you make a coast-to-coast phone call, you don't depend on uniform wire thickness between here and New York. You use protocols that make your transmission independent of the physical medium. We will do the same thing on ICs: protocol-based communications with error recovery instead of point-to-point wiring."'
You know, if you stop to think about it, EE Times is just so damn good. It's their wonderful reporters. It's a joy to read their pieces. They go beyond regurgitating press releases and find new angles on stories, relate inside perspectives and motivations behind the PR spin. Often, as in this case, they go to conferences that we can't, and allow us to share, vicariously, in some of the interesting buzz. They're obviously techies. You don't catch EE Times reporters confusing megabits and megabytes. They seem more resistant to bamboozling by marketing. And they seem to report more news stories, faster, than the other publications (many of which are also very good).
I visited the Cahners/EE Times booth at ESC and said as much.

Friday, March 29, 2002
Two years
Happy second birthday to XSOC/xr16, and to this web site, and to the fpga-cpu list. Two years ago this month, Circuit Cellar ran part one of the XSOC series.
Since then, we have witnessed high powered FPGA SoC/CPU product offerings, first from Altera, whose Nios product truly legitimized the market, and then Xilinx, and have seen a groundswell of interest in this field.
Sometime this winter this site passed a half million "page views", and of late we about average about 600 "visitors" per day. The mailing list has over 450 subscribers and has seen over 1,000 messages.
I wish you all, dear readers, a sincere thank you for your interest.
Rotor
Microsoft ships Rotor, a Shared Source implementation of the Common Language Infrastructure. Beta download. The License. Developmentor's DOTNET-ROTOR mailing list.
David Stutz, Microsoft: An Architectural Tour of Rotor.
Brian Jepson: Get Your Rotor Running.
(Meanwhile, work continues apace on Mono, "an Open Source implementation of the .NET Development Framework". Recently Mono reached the important milestone of self-hosting most of itself.)
My earlier raves on .NET.
"You may be wondering, how is .NET relevant to FPGA CPUs? Among other things, the .NET CLR looks like a great platform to build multi-language cross-compilers for new instruction set architectures."

Got .NET?
New FPGA CPU Projects
Martin Schoeberl's JOP (Java Optimized Processor). You'll enjoy the pages detailing the evolution of the design, and the comprehensive set of links to other Java processors.
JOP Hardware Status:

"processor core runs on an Altera ACEX 1K30-3 (a $13 FPGA) with 24 MHz"

"JOP1 core: about 600 LCs and 6 EABs (1/3 of the FPGA), max. clock 30 MHz"

"JOP2 core: 380-450 LCs with 40 - 45 MHz in an ACEX 1K30-3"

"JOP3 core: 900 LCs, max. clock 44 MHz in an ACEX 1K30-3"

"two JOP2 cores fit in one 1K30 (and are running)"

"periphery: IO port, SRAM- and Flash interface, UART, ECP, IIC"

"performance: JOP3 is about 11 times faster than interpreting JVM on an Intel 486"

Note "JVM: about 25% of byte codes implemented"...
Ali Mashtizadeh announces his HyperMTA project at OpenCores.org.
It is to be a MP-MT-VLIW. My thoughts.

It is easy to implement a VLIW in an FPGA -- although personally I'd stick to an LIW.

It is straightforward to implement a MT-VLIW in an FPGA, using BRAM for multi-context register files.

It is challenging to implement a memory system for a MT-VLIW in an FPGA, assuming it is expected to keep up with one load or store per cycle.

It is very challenging to implement a memory system for a mulitprocessor of MT-VLIWs.

It is very, very hard to implement an effective suite of compilation tools for a MP-MT-VLIW.

That said, good luck, I think MTAs are a plausible and promising approach to extracting performance and available parallelism from dusty deck source code.
We discussed multithreaded FPGA processor architectures back in November.
Upcoming events
Programmable World 2002: April 17, 2002, at a simulcast near you.
FCCM'02: 2002 IEEE Symposium on Field-Programmable Custom Computing Machines, April 21 - April 24, 2002, Napa Valley, CA. Early registration ends April 1!
Xilinx EasyPath
Xilinx: Xilinx Introduces Breakthrough Cost Reduction Technology for Virtex-II FPGA Family. FAQ (PDF). Press pitch (PDF slides).
Hmm, I thought I saw some slides which discussed die failure modes, bridging faults, used and unused LUTs and PIPs, and the like, but now I can't find them to link to them. Help, anyone?
The general idea, of hosting logic in a partially-defective FPGA, is not new (and may well be one of the most important ideas in the coming era of molecular electronics).
Bruce Culbertson et al, HP Labs, Defect Tolerance on the Teramac Custom Computer (PS) (PDF).
Philip Kuekes, HP Labs, on MURL: Defect Tolerant Molecular Electronics.
However, prior to EasyPath, I am not aware of any proposal to take a one or more fixed designs and test partially defective FPGA lots against those design(s)' resources.
I wonder: is the cost reduction a function of utilization? If I design a floorplanned datapath that (more or less) uses every input of every 4-LUT; uses every block RAM; uses as many interconnect resources as possible -- would that design cost more to EasyPath-ize than a more scattershot one produced by a synthesis tool and scrambled up by PAR?
And I wonder if Xilinx will also permit a customer with a reconfigurable design to test each configuration of that design? Taken to the extreme, the answer would have to be no, otherwise a cost-sensitive smart aleck customer could submit their own set of reconfigurable 100% test coverage designs, then buy lots of defect-free EasyPath FPGAs. :-)
Anthony Cataldo, EE Times: Xilinx looks to ease path to custom FPGAs.
Crista Souza, EBN: Xilinx's EasyPath seeks to skirt costly redesigns.
[updated 04/16/02]
Murray Disman, ChipCenter: Xilinx Introduces EasyPath.
Altera SOPC Builder and Microtronix Embedded Linux for Nios
For me, one of the highlights of the Embedded Systems Conference was the Altera booth demonstrating Altera's SOPC Builder 2.5 for Nios, alongside Microtronix's Embedded Linux Development Kit for Nios (previous sighting).
To the Nios development board, Altera and Microtronix added a card with some FLASH and RAM, and another card with interface to an IDE hard disk, and another with a CS8900A-based 10baseT ethernet interface.
What a thrill to power the thing on, watch uCLinux boot, uncompress itself, autoconfigure, create a RAM file system, and then bring up a login: prompt.
I logged in and found myself in /bin/sh (or facsimile). The file system was necessarily austere and (no surprise) the Nios GCC dev tools were not there, but it was still great to run "ps" and find a dozen processes running including inetd. Yes, this demo was configured to serve up telnet and http back to the PC through the ethernet connection.
Now let me tell you about SOPC Builder. Altera has put a lot of good thinking, and polish, into this tool. It lives in Quartus II (I think) and presents you with a helpful GUI to configure your Nios processor, your peripheral modules, your Avalon bus, bus masters, interrupt priority assignments, and so forth.
There was considerable depth to the product. For example, it seemed to ease the complicated job of configuring a system with arbitrary topologies of multiple buses, bus masters, peripherals, interrupt sources, etc. by presenting the entire configuration in a clever "spreadsheet" format that shows which modules connect to which modules on which buses.
And SOPC Builder is slick. For example, as you check or uncheck feature boxes, you see a running total of number of LEs required for the design. This works even for third party modules (cores). Another example: as you add modules to your design, SOPC Builder helps configure a basic test bench to test those modules. For example, if you instantiate a UART, you can also specify some a string of characters to transmit into the UART, in the test bench.
Significantly, the design of SOPC Builder facilitates third party extensions. Third party module authors write scripts (text files) to define new tabbed dialog box property pages for their modules. It appeared that you can even embed Perl code into the scripts to perform arbitrary actions (e.g. LE counts) as the system designer specifies module options and parameters.
One of the more impressive SOPC modules (in beta?) was the Microtronix uCLinux kit itself. It seemed to allow easy configuration of many uCLinux parameters in the software domain. Here's a Microtronix PowerPoint deck demonstrating this module.
Now here's a scenario that shows the potential of all this integration. You add the uCLinux module. You check a box to select some uCLinux networking features. Besides configuring the software, SOPCBuilder could potentially (or perhaps it already does) prompt you to add (and automatically configure) a network interface or serial line (SLIP) interface to the hardware configuration of your system design.
(One tools integration challenge: can you handle arbitrary incremental changes forwards and backwards? For example, if later I remove the network interface, I should see a dialog noting this change is inconsistent with my specification of uCLinux networking features.)
Deja vu
Do you know what SOPCBuilder reminds me of? Visual C++ 1.0. Ten years ago, I had the privilege of working on Microsoft C/C++ 7.0, and the first several releases of Visual C++. C7 was an OK compiler, an enabling technology. But it did not begin to solve the biggest problem facing our customers, which was writing decent Windows platform applications. A whole generation of potential Windows developers faced the daunting challenge of learning both the Windows API plumbing details, and more difficult, the overall "application architecture" of a working Windows application. How to architect the application into documents and views, how to do scrolling windows, how to manage coordinate systems, how to do printing and print preview, how to do menus, tool bar buttons, and so forth. And even if they did grok Windows application architecture, the day-to-day grind of composing dialog boxes and then adding object oriented method declarations and bodies to do UI event handling was just too tedious.
In those days, the Visual C++ 1.0 designers added a class library, MFC, to provide a prepackaged, working application architecture, integrated the GUI design tools with the code editing tools, and provided Wizards (new in VC++) to pre-configure a new working project, and (ClassWizard) to do the boilerplate code and event dispatching you always need as you added gadgets to your GUI.
This integration was a great advance. The new project Wizard allowed every Windows developer newbie to get their first Windows app running in one minute. In all, VC++ probably cut several months of time, and at least one throwaway product, out of everyone's Windows development learning curve.
One effect of Visual C++ was that the industry dialog moved away from benchmark wars and "whose compiler optimizes better?" and towards arguing "whose development environment helps you get your work done quickest and easiest?".
In a similar way, SOPC Builder looks like it could eliminate many barriers to getting that first crucial design working, providing a nice "out of box experience" for newbie Altera SOC designers. And with all the system parameter data it manages, it also has the potential to help the system designer with day-to-day integration tasks and bookkeeping.
So perhaps Altera has an early Visual C++ on their hands. It remains to be seen if they can provide a sufficiently deep and still easy to use integrated environment, but the early returns look good.
Murray Disman, ChipCenter: Altera Announces SOPC Builder.
[updated 04/05/02:]
Crista Souza, EB News: Altera casts spotlight on IP integration tool.
Disclaimer: all of my remarks are based upon booth demos and two Altera Nios and Excalibur seminars. I have precious little experience with actually building anything original with Nios or SOPC Builder. Your mileage may vary. That said...
Disclosure
I want you to know that after speaking with them, Altera volunteered to donate a NIOS Kit to me (my company), an unexpected and generous offer which I have gratefully accepted. There was no discussion of quid pro quo as regards this web site. For the record, we have to date been given low cost teaching kits by two other companies, never at my instigation. Except for these donations, we've paid for everything else we use, which includes not inexpensive licenses from Synplicity, Aldec, and Xilinx. Products and ideas are mentioned here only because they are interesting to me or because I think they may be interesting or useful to you, dear reader. And you may as well assume I have invested in one or more programmable logic related companies.
The end of monolithic microprocessors
The keynote speech by Henry Samueli of Broadcom aptly demonstrated that we have left the era of monolithic microprocessors (except perhaps for personal computers) and are now working into the era of highly integrated systems. Take for example, the Broadcom BCM3351 (PDF) VoIP Broadband Gateway, which integrates a myriad functions into a single chip. Oh yes, by the way, there is a tiny MIPS core down in there, somewhere.
John Mashey's greatest hits
John Mashey, then mash (at) mips.com, used to write these amazing, thought provoking, authoritative, long pieces for comp.arch. You could learn an awful lot about real world computer architecture just by reading the discussion threads he contributed to.
This site has some of his best pieces. Do yourself a favor and spend a few minutes reading just the first and the last of these. Wonderful stuff.

Tuesday, March 12, 2002
Paul Glover, Xilinx, in ISD: FPGA Is as Good as its Embedded Plan.
"Floor planning allows designers to control placement and grouping of the embedded processor, the associated intellectual property and their custom logic, thereby simplifying the process of complex system-on-chip development and increasing design performance."

Monday, March 11, 2002
More on Jan's Razor
Re: last week's Jan's Razor piece, there was an interesting discussion on the fpga-cpu list. Reinoud challenged the idea that a cluster of uniprocessors with some shared resources is the best approach, and noted that in a VLIW you can indeed allocate fractional resources to each execution unit.
"The limitations you claim for uniprocessor design exist only if you restrict yourself to scalar processors. For superscalar and VLIW designs, you have a similar freedom to ratio function unit types. For example, a typical superscalar design with three integer issue slots will support multiply on only one of those."
Indeed! My response.
Applications of racks full of FPGA multiprocessors?
Reinoud also asked
"Finally, there may of course be applications where the relatively large amount of control provided by the many instruction streams in your approach (Sea of Cores - SoC?;) are an advantage. The challenge will be in finding those applications... Can you think of any?"
[First, let me note that most of the time, I too would prefer 1 1000 MIPS processor to 10 200 MIPS processors or 100 50 MIPS processors. That said ...] Read along with me, to the sound of future patents flushing...
I confess, looking at the V600E 60-way MP I described recently, or its logical follow ons in V2 and so forth, I confess that these are paper tigers, with a lot of integer MIPS, in want of an application. Aggregate "D-MIPS" is not an application!
I suppose my pet hand-wavy application for these concept chip-MPs is lexing and parsing XML and filtering that (and/or parse table construction for same). Let me set the stage for you.
Imagine a future in which "web services" are ubiquitous -- the internet has evolved into a true distributed operating system, a cloud offering services to several billion connected devices. Imagine that the current leading transport candidate for internet RPC, namely SOAP -- (Simple Object Access Protocol, e.g. XML encoded RPC arguments and return values, on an HTTP transport, with interfaces described in WSDL (itself based upon XML Schema)) -- imagine SOAP indeed becomes the standard internet RPC. That's a ton of XML flying around. You will want your routers and firewalls, etc. of the future to filter, classify, route, etc. that XML at wire speed. That's a ton of ASCII lexing, parsing, and filtering. It's trivially parallelizable -- every second a thousand or a million separate HTTP sessions flash past your ports -- and therefore potentially a nice application for rack full of FPGAs, most FPGAs implementing a 100-way parsing and classification multiprocessor.
VLIWs
Reinoud brought up VLIWs. Rob Finch is designing one and had some questions.
"I want to do a VLIW processor in an FPGA, but I'm not confident what I'm doing. Does anyone have links to sample (educational) VLIW processor designs ? I've searched the net but can't seem to find anything."

My comments (which in retrospect didn't really answer Finch's questions):
VLIW ::= very long instruction word: a machine with one instruction pointer (how's that, Reinoud? :-) ) that selects a long instruction word that issues a plurality of operations to a plurality of function units each cycle.
In a VLIW system, the compiler schedules the multiple operations into instructions; in contrast, in a superscalar RISC, the instruction issue hardware schedules one or more scalar instructions to issue at the same time. You may also see the term LIW vs. VLIW. I think of an LIW as a 2 or 3 issue machine, a VLIW as a 3+ issue machine. In some VLIWs, to improve code density, a variable number of operations can be issued each cycle.
There's a rich literature on VLIWs (that I don't purport to have read), but be sure to see: John R. Ellis, Bulldog: A Compiler for VLIW Architectures.
The IA64 EPIC (explicitly parallel instruction computing) architecture is the most notable VLIW derivative. Several DSPs like the TI C6x are LIWs. A famous early VLIW was the Multiflow Trace family, a supercomputer startup that lived approximately 1984-1990.
The challenging part of a VLIW project is the compiler. Unless you're a glutton for tricky assembly programming, or have a very small and specific problem, it's hardly worth designing the hardware if you don't have a way to compile code to use the parallelism presented by the hardware. Indeed, some LIW design suites (IIRC the Philips LIFE) allow you to pour C code through a compiler and simulator and configure the number and kind of function units to best trade off performance and area against the specific computation kernel you care about.
FPGA VLIWs
On to FPGA VLIW CPU issues. Here are some quick comments.
To summarize my opinions, is a 2-issue machine is straightforward and even worthwhile, whereas (depending upon computation kernel) a 4+ issue machine may well bog down in its multiplexers and therefore may not make the best use of FPGA resources.
Instruction fetch and issue
No sweat. Store instructions in a block RAM based instruction cache. Each BRAM in Virtex derived architectures can read 2x16=32 bits each cycle; each BRAM in Virtex-II can read 2x36=72 bits each cycle. Use 2-4 BRAMs and you can easily fetch 100-200 bits of instruction word each cycle.
Keeping the I-cache fed is another matter, left as an exercise.
Register file design
The key design problem is the register file organization. If you are going to issue n operations each cycle, you must fetch 1-2*n operands and retire up to n results each cycle. That implies a 1-2*n-read n-write per cycle register file.
As a rule of thumb, in a modern LUT based FPGA, if a pipelined ALU operation requires about T ns, then you can read or write to a single port LUT RAM (or read and write to a dual port LUT RAM) in about T/2 ns. (In Virtex-II, T is about 5 ns).
Let's discuss an n=4 operation machine. It is a challenge to retire 4 results to a LUT RAM based RF in T ns. Sure, if the cycle time balloons to 2T, you can sequentially retire all four results in 4*T/2 ns, and read operands in parallel using dual port LUT RAM.
Thus in a fully general organization, where any of the four results can retire to any four registers, "it can't be done" in T ns as defined above.
Instead, consider a partitioned register file design. For instance, imagine a 64-register machine partitioned into 4 banks of 16 registers each. Each cycle, one result can be retired into each bank. That is easily targeted to a LUT RAM implementation. Indeed, you can fix the machine such that each issue slot is associated with one write port on the register file.
We can easily issue four independent instructions such as
r0=r1+r2; r16=r17+r18; r32=r33+r34; r48=r49+r50
One $64 design question, properly answerable only with a good compiler at your side, is how to then arrange the read ports. At one extreme, if each of the four banks is entirely isolated from the other, (like a left-brain/right-brain patient with their corpus collosum cut), then you only need 2 read ports on each bank. In this organization, if you need a register or two from another bank, you would typically burn an issue slot in that sourcing bank to read the registers, and perhaps another in the destination bank to save the registers. (Alternately in the destination bank you can directly mux it into the dependent operand register).
At the other extreme, if any of the four banks can with full generality read operands from any of the other three banks, e.g.
r0=r1+r2; r16=r3+r4; r32=r5+r6; r48=r7+r8 // cycle 1 r0=r17+r18; r16=r19+r20; r32=r21+r22; r48=r23+r24 // cycle 2 r0=r33+r34; r16=r35+r36; r32=r37+r38; r48=r39+r40 // cycle 3 r0=r49+r50; r16=r51+r52; r32=r53+r54; r48=r55+r56 // cycle 4
Then each of the four banks would need to provide 8 read ports, and each of the 8 operand multiplexers (in front of the four ALUs) would need at least a 4-1 mux.
You can of course provide all these ports by replicating or multi-cycling the register file LUT RAMs, but it won't be pretty.
In practice, I believe that the expected degree of cross-bank register reading is limited. Maybe you only need 3 read ports per bank, perhaps 1 1/2 for that bank's ALU and perhaps 1 1/2 for other slots' accesses. Again you need a scheduling compiler to help you make these tradeoffs.
By the way, in my 1994 brain dump with the sketch of the NVLIW1 2-issue LIW, I used exactly this partitioned register file technique, combining two banks of 3r-1w register files. For each issue slot, one operand was required to be a register in that bank; but the other operand could be a register in either bank.
Another option is to combine multibanked registers with some heavily multiported registers. For instance, assume each issue slot can read or write 16 bank registers and 16 shared registers, subject (across all issue slots) to some maximum number of shared register file read and write port accesses each cycle. The designers of the Sun MAJC used a similar approach.
Execution unit
Assume a design with 4 ALUs, one branch unit, one memory port.
Loads/stores: Perhaps you limit generality and only allow loads/stores to issue from a given slot (and then shuttle those results to other slot register banks using the above multiported reg file description). The general alternative, allowing any slot to issue loads/stores, requires you to mux any ALU output to be the effective address, and a per-bank mux to allow the MDR input to be the result.
Result forwarding (register file bypass): if an operation in slot i, cycle 1, is permitted with full generality, to consume the result of an operation from slot j, cycle 0, then you need n copies of an n:1 result forwarding mux. Again, this is very expensive, so you will be sorely tempted to reduce the generality of result forwarding, or eliminate it entirely. Again, this is a design tradeoff your compiler must help you refine.
Control flow
You want to reduce the number of branches. In average C code you can expect a branch every five (rule of thumb) generic RISC instructions or so. In a four issue machine you will spend your life branching.
Instead, you will want to use some form of predicated execution. Some computations will set predicate bits that represent the outcomes of conditional tests. Then other instructions' execution (result write-back and other side effects) will be predicated upon these specified predicate bits.
In this organization, it seems reasonable to allow any issue slot to establish predicates that can be used by any other issue slot; but for simplicity you will only need and want one, or perhaps two, of the issue slots to be able to issue (predicated) branch and jump instructions.
You will want a compiler to help you design the specific details of the predication system.
(By the way, predicate bit registers are not the only approach to predicated execution...)
In summary
The sky's the limit. It is easy to build an 8- or 16- or 32-issue VLIW in an FPGA. That's just stamping out more instances of the i-cache + execution unit core slice. Whether the resulting machine runs much faster than a much simpler 2-issue LIW is critically dependent upon your compiler technology.
Aside: WM
As an aside, I'll just briefly mention Wulf's WM, a simple architecture with many interesting features, but of particular interest here, each execute instruction is of the form
rd = (rs1 op rs2) op rs3 ,
an explicitly parallel representation that affords some additional parallelism over the canonical scalar RISC, but which does not increase the number of write ports required on the register file.
Embedded Systems Conference
I'm planning to head down to ESC tomorrow.
On Tuesday, I'm looking forward to the panel, hosted by Jim Turley, titled Drawing the Line Between Hardware and Software (scroll down).
"Two trends seem obvious, yet both are at odds with each other. As embedded systems become more complex, they require more programming. Software adds features, functions, and differentiation. But at the same time, custom ASIC and SoC hardware are on the rise. Custom chips are becoming far more common, replacing complex algorithms with special-purpose hardware. Going forward, which will be the best way to add value to a product: generic hardware with lots of software, or specialized ASICs with little code overhead, or the entirely programmable platforms such as FPGAs or reconfigurable logic? Is it easier, cheaper, and faster to program gates or bytes? This panel will explore the blurring distinction between hardware and software."
Speaking of which: Jim Turley, Embedded Systems Programming: The Death of Hardware Engineering.
This ought to be fun: Challenge the Expert.
"Tim Allen, Altera's chief architect of the Nios embedded processor will be accepting challenges on the amount of time it takes to build a complete system on a programmable chip (SOPC)."
That reminds me.
As early as ten years ago, in the software development world, at conferences like SD, we have had the programming equivalent of a bake-off, a live programming contest in which teams of developers (e.g. best of the best from Borland, Microsoft, and Symantec) would be given one or more problems and would race each other to produce solutions as quickly as possible, using their dev tools, libraries, and so forth. Here for example is this year's SD Developer Bowl. [upated 3/26/02]: No, oops, that sounds like a trivia contest. Well, take my word for it. Those old bake-offs were great.
Wouldn't it be pleasant to see engineers from Xilinx, Altera, Celoxica, Triscend, Atmel, Cypress, Quicklogic, Chameleon, etc. up on stage, racing against time to build solutions with their combination of soft and hard cores, dev tools, system builders, and what have you? One slow place-and-route step could ruin your whole day.

Tuesday, March 5, 2002
Multiprocessors, Jan's Razor, resource sharing, and all that
Continuing our multiprocessor (MP) discussion of Saturday.
Today's theme: Compared to uniprocessor design, building a large-N (N>20) multiprocessor out of compact, austere soft processor core + BRAM processing elements (PEs) is challenging and ... liberating! Liberating in the sense that the hard trade-offs are easier to make due to Jan's Razor (to coin a phrase), the principle that once you achieve a minimalist instruction set that covers your computation, any additional functionality per processor is usually unprofitable and therefore rejected. This drive to leave things out is moderated by the new opportunity, unique to MPs, to apply resource factoring to share instances of critical but relatively-infrequently-used resources between PEs.
Permit me to explain.
Uniprocessor throughput
The greater part of processor and computer system design is deciding what features to put in, and what features to leave out. You include a feature if its benefits outweigh its costs. Usually these costs and benefits are measured in terms of throughput, latency, area, power, complexity, and so forth.
Designers typically focus on throughput, that is, the reciprocal of execution time. In a uniprocessor, the execution time of a computation is the product of no. of instructions (I), times the average no. of cycles per instruction (C/I or CPI), times the cycle time (T). You can reduce the execution time by:

reducing I: provide more powerful, more expressive instructions and/or function units that "cover" the computation in fewer instructions;

reducing CPI: reduce penalties, interlocks, overheads, and latencies (branches, loads); or

reducing T: pipeline deeper, and/or replicate resources instead of sharing them.

The trick, of course, is to reduce some of I, CPI, or T, without increasing the other terms! For example, if you add hardware to do sign-extending load byte instructions (reducing I) you must be careful not to increase the cycle time T.
Multiprocessor throughput
But in a multiprocessor running a parallelizable computation, we can spread the work out over NPE processing elements, so the execution time equation (in ideal circumstances) approximates
time = I * CPI * T / NPE
So here is how this manifests itself at design time, and where "liberating" comes into the picture.
Say you have a very austere PE CPU. This processor provides enough primitive operations to "cover" the integer C operator repertoire, in one or more instructions. For example, like the GR CPUs, your machine may provide AND and XOR natively, but implement OR through A OR B = (A AND B) XOR A XOR B. Or, it may have single-bit shift left (ADD) and right (SRx), but lack multi-bit shift instructions. Or, it may implement multiply as a series of adds.
Now, as a designer, imagine you are staring at a benchmark result, and you see that your benchmark would run 5% faster if your PE had a hardware (multi-bit) barrel shifter function unit. You know that a barrel shifter would add (say) 64 LUTs to your 180 LUTs processor. Maybe you even do a little experiment, and try adding the barrel shifter to your design, to verify that the larger CPU datapath does not inadvertently impact cycle time. So, do you add the feature or leave it out?
In a "pedal to the metal" uniprocessor, you might indeed choose to grow your processor by 30% to get a 5% speedup. Especially if that change would help to keep your product ahead of your rival's offering.
But in a chip multiprocessor, where the entire die is tiled with PEs, if you make your PE 30% larger, this could well reduce NPE by 30%, or more. (Or it might not. It's a nonlinear function and some other resource, like BRAM ports, might be the critical constraint.)
When n% larger PEs implies n% fewer PEs
Against this backdrop of "n% larger PEs implies n% fewer PEs", many "conventional wisdom" must-have features, such as barrel shifters, multipliers, perhaps even byte load/store alignment logic, are indefensible.
The problem for our friend the barrel shifter is this. Even if it dramatically speeds up a bitfield extract operation, from 10 cycles down to 1 cycle, there just aren't (dynamically) that many shift instructions in the instruction mix. So the total speed up (as posited above) was only 5%. Compare the barrel shifter multiplexer LUTs to the other parts of the CPU. The instruction decoder, the register file, the ALU, the PC incrementer, the instruction RAM port -- these are used nearly every darn cycle. Putting these resources in hardware is hardware well spent. Lots of utilization. But our friend the barrel shifter would only be used maybe every 20 instructions.
Overall, the multiprocessor will have higher aggregate throughput if you leave out these occasional-use features and thereby build more PEs per die. Indulge me to name this principle Jan's Razor.
Jan's Razor: In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die.

Fractional function units, resource factoring and sharing
Yet as we design multiprocessor systems-on-chip, we have a new flexibility, and the barrel shifters of the world get a reprieve. When designing a uniprocessor, you essentially are performing an integer (as opposed to linear) programming problem. You must provide some integer 0, 1, 2, etc. ALUs, shifters, memory ports, etc. But in the context of a multiprocessor, you can provide fractional resources to each processing element.
In our scenario above, we really wanted to add, say, 1/10 of a barrel shifter to our processor -- something that ran at full speed but only for one cycle in every ten, on average, at a cost of 1/10 of the area of a full barrel shifter. You can't do that in a uniprocessor. You can do that in a multiprocessor.
Imagine you have a cluster of ten tiny, austere PEs, and assume these PEs already have some kind of shared access bus, for example for access to cluster-shared-memory. To the cluster, add a shared barrel shifter, accessed over this same shared access bus. Now, on average, each PE can issue a fast shift instruction every tenth instruction, for a cost per PE of only 1/10th of a barrel shifter.
Continuing with this example, imagine the ten processors are running a computation kernel that needs to issue a multiply about every five instructions. Here we add two multipliers to the cluster. Statistically the multipliers will run hot (fully utilized), and each PE will run about as fast as if it had a dedicated multiplier, but at a cost of only about 1/5 of a multiplier per PE.
Here we see this fractional function unit approach also provides a new dimension of design flexibility. Depending upon the computational requirements (e.g. the inner loops you find yourself running today), one may be able to add arbitrary numbers of new function units and coprocessors to a cluster without disturbing the carefully hand-tuned processing element tiles themselves.
Rethinking multiprocessor architecture
Once you start really thinking about resource factoring and sharing across PEs in a multiprocessor, it turns a lot of conventional wisdom on its head, and you look at old problems in new ways.
In particular, you start inventorying the parts of a CPU that are not used every cycle, and you start asking yourself whether each processing element deserves an integral or, rather, a fractional instance of that resource.
Such occasional-use resources include barrel shifter and multiplier of course, but also load/store byte align hardware, i-cache miss handling logic, d-cache tag checking, d-cache miss handling, and (most significantly) an MMU.
Let me address the last few of these. Loads and stores happen every few cycles. Maybe each PE doesn't deserve more than one third of a data cache! ... Perhaps it is not unreasonable to have a single shared d-cache between a small cluster of processors.
The MMU is another example. Even if each PE has its own private i-cache and d-cache, if these caches are virtually indexed and tagged, then only on a cache miss do we need to do an MMU address translation and access validation. If the average processor has a cache miss (say) every twenty cycles, and if an MMU access can issue each cycle, then it is not unreasonable to consider sharing a single MMU across ten PEs in a cluster.
In summary
It is wrong to design chip multiprocessors by tiling together some number of instances of die shrunk monolithic uniprocessor cores, each with a full complement of mostly-idling function units.
A more thoughtful approach, achieving fractional function units by careful resource sharing, would appear to yield significant PE area reductions, maximizing overall multiprocessor throughput, and providing the system designer with a new dimension of design flexibility.

Monday, March 4, 2002
Xilinx announces it is now shipping Virtex-II Pro.
"High-volume pricing (25,000 units) in 2004 for the XC2VP4, XC2VP7, and XC2VP20 devices is $120, $180, and $525, respectively."
V2Pro is a blockbuster product, combining the blazing fast Virtex-II programmable logic fabric with 0-4 embedded PowerPC 405s and 4-16 embedded 3.125 Gb/s serial links, and also (approximately) doubling the average number of block RAMs and multipliers per LUT (compared with Virtex-II). Virtex-II Pro Handbook.
Our coverage one year ago.
Here are the new devices in context.
Device Array KLUTS BRAMs Lnks PPCs BR/KL 2V40 8x 8 .5 4 0 0 8.0 2V80 16x 8 1 8 0 0 8.0 2VP2 16x22 3 12 4 0 4.0 2V250 24x16 3 24 0 0 8.0 2VP4 40x22 6 28 4 1 4.7 2V500 32x24 6 32 0 0 5.3 2VP7 40x34 10 44 8 1 4.4 2V1000 40x32 10 40 0 0 4.0 2V1500 48x40 15 48 0 0 3.2 2VP20 56x46 19 88 8 2 4.6 2V2000 56x48 22 56 0 0 2.5 2V3000 64x56 29 96 0 0 3.3 2VP50 88x70 45 216 16 4 4.8 2V4000 80x72 46 120 0 0 2.6 2V6000 112x104 67 144 0 0 2.1 Array: base CLB array (rows x cols) KLUTs: thousands of LUTs (rounded to nearest thousand) BRAMs: no. of 18Kb dual port block RAMs Lnks: no. of 3.125 Gb/s serial links PPCs: no. of PowerPC 405 embedded hard CPUs BR/KL: BRAMs/KLUTs
[Source: analysis based upon data from Virtex-II and Virtex-II Pro data sheets]
(See also Marketing gates redux.)
Notice that for Virtex-II Pro, the number of LUTs is not simply the CLB rows times columns times eight, since some CLBs are displaced by "immersed" IP.
With all the excitement about the embedded PowerPCs and gigabit links, analysts will overlook the huge improvement that these newer devices have relative to Virtex-II: an approximate doubling of the number of BRAMs per LUT in the larger devices. (2VP50 has more than twice the BRAM per LUT of the 2V6000.) As noted last year:

"I had been disappointed that the larger Virtex-II devices seemed relatively block-RAM-port poor. For example, the Virtex-II data sheet states the 120-CLB-column '2V10000 will have only 6 columns of block RAMs, or on average, only one column of block RAMs per 20 columns of CLBs."
"But at 6 CLB columns per block RAM column, this monster FPGA assuages this concern -- and is quite reminiscent of the generously-RAM-endowed Virtex-EM family."
Indeed Virtex-II Pro devices do appear to have about one BRAM per each 4 rows by 6 column tile of CLBs, and that is a near-perfect match for a multiprocessor made of tiles of processing elements, one per 24 CLBs and one BRAM.
Anthony Cataldo, EE Times: Xilinx enhances FPGAs with embedded PowerPCs.
Crista Souza, EBN: Xilinx counters Altera with new FPGA offering.
Loring Wirbel, Communications System Design: Xilinx Tiles PowerPC and I/O on Virtex-II.
Anthony Clark, EE Times, UK: Xilinx begins sampling PowerPC-based FPGA.
Murray Disman, ChipCenter: Virtex-II Pro Arrives. Includes a helpful comparison of the Virtex-II Pro and Excalibur embedded processor and bus architecture choices.
More later.

Saturday, March 2, 2002
Large-N Chip Multiprocessors in FPGAs
On the fpga-cpu list an anonymous correspondent asked "Does anybody try to design multiprocessor in FPGA?"
Emphatic yes!
Some links:

8 gr0000's in an XCV50E

FPGA Multiprocessors (Usenet 1997)

FPGA Multis and Superscalars (Usenet 1997)

FPGA Array Supercomputers (Usenet 1999)

Soft Cores in a Hard World (Usenet 2000)

Hands on Computer Architecture ... ISCA00 WCAE paper (PDF) (section VI: FPGA CPUs for Advanced Computer Architecture Studies and Research)

SoCrates

Virtex-II Pro: 500 soft CPUs per chip?

60 RISC CPUs in one V600E
Last spring I designed (through to place-and-route) 60 16-bit pipelined RISC gr1040i's in one XCV600E, and also 36 32-bit pipelined gr1050i's in one XCV600E. Sorry, a trip got in the way of finishing the project and writing up the results.
Details:
gr1040i 16-bit, 8Rx6C CLBs, 2 BRAM ports 12.5 ns (80 MHz) gr1050i 32-bit, 16Rx6C CLBs, 4 BRAM ports 15 ns (67 MHz) (alt. 8Rx10C CLBs)
The above cycle times get derated by 1-2 ns when you build a big array of processing elements -- PAR gets slow and stupid on big chips, even though the above processors' datapaths were completely hand technology mapped and floorplanned.
For the 60-way MP, each processor got a private 512 B BRAM for its code and data, and each cluster of 5 processors on a half-row shared a sixth 512 B BRAM for shared memory. The unfinished part of the design was a switch so that any processor in cluster[i] could access the shared memory BRAM of cluster[j].
Floorplan:
bB PA0 PA1 BB PA2 PA3 BB PA4 F6 F6 PG4 BB PG3 PG2 BB PG1 PG0 Bb bB PB0 PB1 BB PB2 PB3 BB PB4 F6 F6 PH4 BB PH3 PH2 BB PH1 PH0 Bb bB PC0 PC1 BB PC2 PC3 BB PC4 F6 F6 PI4 BB PI3 PI2 BB PI1 PI0 Bb bB PD0 PD1 BB PD2 PD3 BB PD4 F6 F6 PJ4 BB PJ3 PJ2 BB PJ1 PJ0 Bb bB PE0 PE1 BB PE2 PE3 BB PE4 F6 F6 PK4 BB PK3 PK2 BB PK1 PK0 Bb bB PF0 PF1 BB PF2 PF3 BB PF4 F6 F6 PL4 BB PL3 PL2 BB PL1 PL0 Bb

Recall an XCV600E is a 48R x 72C FPGA with 72 512 B BRAMS in (12R x 6C).

P[c][i]: 16-bit pipelined RISC processing element in cluster c, column i; each uses 8Rx6C of CLBs

F6: 8Rx6C of "free" unused CLBs (pending the inter-cluster switch interconnect design).

B: one 512 B block RAM used for private instruction and data memory for one processing element

b: one 512 B block RAM used for shared data memory for one cluster of 5 PEs on that half-row (e.g. b at top left shared by PA[0:4]).

The floorplan (and the interconnect design, etc.) were completely determined by the V600E's arrangement of BRAMs. For compact-area soft cores, the BRAM number and placement is the greatest determinant of how many processors you can implement. The V600E is a sweet spot, a lovely balance of CLBs and BRAMs.
The programming model was to have been one thread per PE with separate private memory and shared memory address spaces -- not message passing, etc. (I believe shared memory multiprocessors are the easiest to parallelize dusty deck C code against even though they're harder to scale up into the thousands of processors).
"Parallelism: the new imperative." -- Jim Gray, Microsoft

FPGA CPU News, Vol. 3, No. 3
Back issues: Vol.3 (2002): Jan Feb; Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec; Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.