fpgacpu.org - On FPGAs as PC Coprocessors

On FPGAs as PC Coprocessors

Home

Regexps in FPGAs >>
<< Emulating FPGAs

Usenet Postings
  By Subject
  By Date

FPGA CPUs
  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  Transputers
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

Multiprocessors
  Multis and fast unis
  Inner loop datapaths
  Supercomputers

Systems-on-a-Chip
  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

CNets
  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

Miscellaneous
  Floorplanning
  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Newsgroups: comp.arch.fpga
Subject: On FPGAs as PC coprocessors
Date: 6 May 1996 22:12:11 GMT

One of the "way out speculation" questions asked at FCCM 96 (IEEE 
Symposium on FPGAs for Custom Computing Machines) was "when, if ever, 
will an FPGA coprocessor ship on every PC motherboard?"

Ignoring the daunting language and interface standards issues, and just 
looking at current hardware approaches to FPGA "coprocessors", the 
answer must be "not any time soon".

Consider that today's power user's CPU (such as a 200 MHz Pentium Pro 
or a 400 MHz Alpha 21164A) *running out of L1 cache* issues peak three 
or four instructions per 2.5-5 ns clock.  Also consider that the 
current "low latency high bandwidth" approach to FPGA coprocessor 
integration is to hang the FPGA on the PCI bus.

In this scenario, the kinds of quasi-general purpose computing problems 
that an FPGA coprocessor can usefully assist with are quite limited.  
Issuing a write and then a read back operation to the FPGA could easily 
take 10 PCI bus cycles (300 ns), assuming no PCI bus contention.  In 
that time (assuming hand crafted code (less effort than a FCCM)) a 
Pentium Pro could issue as many as 180 instructions; the Alpha, 480 
64-bit instructions. Future versions of such processors will soon be 
doing VIS- or MMX-like limited bytewise 8+-way SIMD parallelism, and 
eventually superscalar versions of same.  Such designs might issue 
between 500 (*) and 4000 (**) hand coded packed byte operations while 
that single 300 ns FPGA write/read is still in progress.

((At peak speeds like 400e6 clock/s * 4 issue/clock * 8 byte ops/issue, 
e.g. 12e9 byte ops/s, you have to agree that superscalar micros with 
bytewise SIMD are going to displace many FPGA applications on PC and 
workstation platforms.))

So as long as FPGAs are attached on relatively glacially slow I/O buses 
-- including 32-bit 33 MHz PCI -- it seems unlikely they will be of 
much use in general purpose PC processor acceleration.  Sure, for 
applications such as cryptography, image and signal processing, they 
might be a win (***), given a semi-autonomous problem which either fits 
in the FPGA and local storage, or which can employ DMA to stream data 
into or through the FPGA without much CPU intervention or management.

Of course, the PCI ASIC crowd has the same latency problems, but they 
don’t share FCCM aspirations of accelerating general purpose computing, 
rather they focus on the same aforementioned special purpose 
applications.

Five times better latency and four times better bandwidth could be 
achieved if FPGA vendors invent a way to directly connect their parts 
to the Pentium Pro external bus, as a peer of the memory/bus 
controller.  A custom, dedicated Pentium Pro interface would probably 
be required, since FPGA configurable logic would be too slow and 
electrically incompatible.

This could be a good volume business, and not quite the moving target 
it might appear -- I expect the PPro external bus to be just as 
ubiquitous and as long lived as have been the 486 and Pentium buses.  
Someone could make a plug in card which sits in the PPro ZIF socket and 
which hosts a PPro and its FPGA(s).

Alternately, the FPGA coprocessor could be attached on the new advanced 
graphics memory port, or whatever it is to be called, that will be 
available in future Intel memory/PCI controller chipsets.

One might argue that Xilinx made a big mistake in not offering a 
version of the XC6200 with a dedicated 66 MHz Pentium external bus 
interface -- after all it is by far the most popular and most supported 
processor interface for the most lucrative general computing market.

If any vendor does pursue this idea, I would appreciate a couple of 
sample parts. :-)

--
(*) 500 op in 300 ns: forthcoming 200 MHz PPro with MMX: 60 clocks x 1 
8-byte MMX insn/clock

(**) 4000 op in 300 ns: hypothetical 400 MHz Alpha with each integer 
unit enhanced for bytewise SIMD: 120 clocks x 4 8-byte insns/clock

(***) "win": much cheaper/faster than simply adding a second processor
--

Acknowledgements: this posting is a spin-off of a discussion with Mark 
Shand, and the "way out speculative" question was suggested by Mike 
Butts.

Jan Gray