fpgacpu.org - FPGA Multis and Superscalars

FPGA Multis and Superscalars

Home

Inner loop datapaths >>
<< Multiprocessors

Usenet Postings
  By Subject
  By Date

FPGA CPUs
  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  Transputers
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

Multiprocessors
  Multis and fast unis
  Inner loop datapaths
  Supercomputers

Systems-on-a-Chip
  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

CNets
  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

Miscellaneous
  Floorplanning
  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Subject: Re: FPGA multiprocessors => vs. uniprocessors
Date: 07 Oct 1997 00:00:00 GMT
Newsgroups: comp.arch.fpga

Jack Greenbaum <spamfilt-@greenbaum.us.com> wrote in article
<ljg1qehm38.fsf@greenbaum.us.com>...
> "Jan Gray" <jsgray-@acm.org.nospam> writes:
> > Assuming careful floorplanning, it should be possible to place six
32-bit
> > processor tiles, or twelve 16-bit processor tiles, in a single 56x56
> > XC4085XL with space left over for interprocessor interconnect. 
> 
> You might be interested in another view of single chip multiprocessors.
> 
> Patt, et. al. "One Billion Transistors, One Uniprocessor, One Chip",
> IEEE Computer, Sept 1997, pp 51-57.
> 
> They argue against multiple processors on a single chip, because it
> makes what is already an I/O bound system even worse. Just because you
> can put multiple processors on a dies doesn't mean you can feed them
> instructions and data.
> 
> Jack Greenbaum -- jack at greenbaum.us.com
> 

This month's IEEE Computer was certainly a blockbuster.
With all respect to Dr. Patt and the U.Mich guys, whose
earlier HPS ideas are proven and shipped in nearly every
high-end microprocessor, and whose present paper
is most intriguing, I'm afraid I was somewhat more convinced
by the Hammond et al paper "A Single-Chip Multiprocessor"
in the same issue.  In particular, I guess I don't yet
believe that branch predictors can get as good as they
say, especially on real world spaghetti object-oriented code.
But who knows where compiler technology, especially dynamic
recompilation, will take us in the years to come.

Also, Patt et al.'s statement is:
	"In our view, if the design point is performance, the
	implementation of choice is a very high-performance
	uniprocessor on each chip, with chips interconnected
	to form a shared memory multiprocessor."

This does not appear to be a statement about throughput or
about price/performance.  Indeed, today, processor implementations
are usually compared using single threaded benchmarks.  But
I think this will change in the next decade.  For example, my
real job is writing infrastructure for transaction processing
for distributed COM objects, and we can certainly keep many
many threads busy with useful work.
  (//www.microsoft.com/transaction)
  (//www.microsoft.com/backoffice/scalability/billion.htm)

Anyway, back to my posting about multiprocessor FPGAs.
I wrote it not because I seriously think it's the best way
to use FPGAs, but because I thought it remarkable that
FPGAs are now large enough to host half a dozen or a dozen
modest simple RISC processors.

And although I pointed out it was a paper design, I did
account for providing adequate instruction/data bandwidth
to each processor.  In the case of the 6-way 32-bit RISC
multiprocessor, there were 6 small I-caches and six separate
memory ports to DRAM (could be SRAM, I suppose).
The XC4085XL has enough IOBs and pins to do this.

(Certainly if Xilinx hurries up and licenses the RAMBUS interface
there will be pins and bandwidth galore.)

Also, FPGA RISC processors are of course relatively slow.
A straightforward 3- or 4-stage pipelined implementation
of a single-issue machine should go 40 MHz or so.
A more deeply pipelined microarch. could approach 80 MHz.
(In today's XC4000XLs we are unlikely to significantly
exceed 100 MHz because that is approximately the
register file (distributed RAM block) cycle rate.)

Their slowness means their bandwidth needs are more modest.
And their issue penalty for cache misses is also less severe.
So you don't need big caches, just a fairly efficient
memory cycle -- SRAM, page mode EDO, or open-bank
SDRAM accesses.

Now, the multiprocessor I originally described had (say)
six separate memory banks, each private to a processor.
A more useful and more readily programmed machine
would provide a single shared memory.  I'm thinking
of a design where (say) 4 processors contend for 2 or 4
address-interleaved banks of memory.  You use the
center part of the FPGA as a 4x2 or maybe 4x4 x32
crossbar switch so that several processors
can simultaneously access separate memory banks.

Of course, I haven't simulated this, so don't take it too
seriously.

Finally, let's discuss where we can go with a fast uniprocessor
on a large FPGA.  I have given this some thought over the
years.

One big challenge is register file design.  The custom guys don't
blink (much) at producing 1-cycle 8-read 4-write register files for
their wide issue superscalars.  But given today's Xilinx distributed
RAMs this is unachievable.  In my processors I do 1-cycle
2-read 1-write register files using 2 copies of the 1-read 1-write
distributed RAM.  But doing 1-cycle fully arbitrary n-read 2-write
reg files is damn hard.  Instead it is better to move to an LIW
ISA with multiple independent register files.  For instance a
2-issue LIW would have instructions like:
	op1 dest1 src1a src1b    op2 dest2 src2a src2b
where dest1 is retired into reg file 1 and dest2 into reg file 2.
With a few more copies of reg file 1 and reg file 2 we can allow
some or all the source operands to read from either reg file.
(For instance we can build dual 3-read 1-write reg files with
six words of distributed RAM.)

Another challenge is ALU delay.  Including clock-to-out,
ALU latency, and setup time and interconnect, etc.,
this can be >20 ns.  To speed this up requires either
putting a register in the middle of the adder (not good) or
duplicate adders for even/odd cycles (good) and a two cycle
adder delay.  Using this technique you can probably clock
a processor at 66 or 80 MHz.

Put these ideas together and one can certainly see a
66 MHz 2 issue LIW in a XC4013E and perhaps a 4 issue
VLIW in a XC4036XL.  But for the latter you need a very
good optimizing compiler.

Cheers,
Jan Gray

Subject: Re: FPGA multiprocessors => vs. uniprocessors
Date: 07 Oct 1997 00:00:00 GMT
Newsgroups: comp.arch.fpga

Me again.  I wrote:
> Put these ideas together and one can certainly see a
> 66 MHz 2 issue LIW in a XC4013E and perhaps a 4 issue
> VLIW in a XC4036XL.  But for the latter you need a very
> good optimizing compiler.

It can be done, but I think I chose the wrong parts.  First,
for this speed we need each half LIW processor
to get an I-cache slice or at least a loop buffer.

This widens the datapath from 2x16x13 (say) to
2x16x20 CLBs and forces you up into the larger
XC4000XL parts.

Jan Gray