FPGA Array Supercomputers


SoC On-Chip Buses >>
<< Inner loop datapaths

Usenet Postings
  By Subject
  By Date

  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

  Multis and fast unis
  Inner loop datapaths

  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Google SiteSearch
Subject: FPGA array computers
Date: 15 Feb 1999 00:00:00 GMT
Newsgroups: comp.sys.super,comp.arch.fpga

Greg Pfister wrote in message <36C5D127.2096E5F7@usNOSPAM.ibm.com>...
>Anybody wanna bet that this sucker is SIMD?

If it is an array of FPGAs, it can be SIMD one moment, MIMD another, and
zero-instruction-multiple data (parallel hardwired datapaths) the next.  Or
a hybrid, a SIMD or MIMD where each datapath also has problem specific
function units.  See www.optimagic.com/research.html for links to some
FPGA computing research.

How about some numbers?  What might one build from (say) 256 Xilinx XCV300
Virtex FPGAs, each of which has 32x48 4-bit configurable logic blocks + 16
256x16 dual-ported block SRAMs?  Let's consider some examples -- blue sky,
back of the envelope numbers based solely upon available FPGA resources.

(Disclaimers: I build homebrew FPGA RISC uniprocessors but my MP designs are
paper tigers.  Any numbers are approximate, best case, will not exceed,
peak numbers.  Actual numbers may underwhelm.  Machines may be very
difficult to program.  These design sketches may prove unroutable.  Etc.)

1. 0IMD (hardwired datapaths): array of 16-bit adders:
256 * 32*48*4 / 16 = 256 * 384 = 98000 adders at 100 MHz = 10e12 adds/s

array of 16-bit adders + 16-word reg files:
256 * 32*48*4 / 32  = 256 * 192 = 49000 adders at 100 MHz = 5e12 adds/s

2. SIMD: array of 16-bit datapaths.

Assume each 16-bit datapath has:
* 16 word register file
* add/sub, logic unit
* operand mux
* SIMD control logic (conditionally suppress result writeback, etc.)
* shared access to long-line operand/result broadcast bus
8R*3C=24 Virtex CLBs

Assume 80% of FPGA area is datapath tiles and 20% is interconnect, memory
interface, and control.

256 * 32*48 / 24 * 0.8 = 16384 * 0.8 = 13000 datapaths at 50 MHz = 600e9

3. MIMD: array of 32-bit RISC processors suitable for straightforward
targeting from a C or FORTRAN compiler:

Assume each 32-bit processor has:
* 4-stage pipeline (IF/RF/EX/WB)
* 16-bit instructions
* 2R-1W 16 word x 32-bit register file
* result forwarding (reg file bypass mux)
* immediate operand mux
* add/sub, logic unit, <<, >> 1, 4 (at least)
* PC, PC incrementer, relative branches, conditional branches
* jump, jump and link
* memory address mux and register
* pipelined control unit
* 32-entry x 8 halfword-line i-cache (e.g. 256 instruction L1 i-cache)
* no d-cache and no floating point
16R*8C CLBs = 128 CLBs, + 1 256x16 block RAM

This gives 8 processors per XCV300 and leaves 1/3 of chip area (32*16 CLBs)
and half the block RAMs free for memory interface and interconnect.

256 FPGAs * 8 CPUs/FPGA = 2000 32-bit processors at 50 MHz = 100e9 MIMD
32-bit ops/s

See also my old FPGA MP-on-chip discussion thread at
http://dejanews.com/getdoc.xp?AN=277216882.  (Note, CLBs there are the 2-bit
CLBs of the XC4000 family, not the 4-bit CLBs of the Virtex family.)

Other comments.

Interconnect?  Memory bandwidth?  Consider a hypothetical "XYZ" machine
using a simple 3D mesh of 16 boards of 4x4 XCV300s.  Give each FPGA 128 bits
of SDRAM -- 2 DIMM sockets w/ 64 MB each for a total of 256*2*64 MB = 32 GB.

Add a 17th XCV300+SDRAM per board for configuration and control.

Configure each FPGA with 6 (NSEWUD) 16-bit channels, for 400 MB/s/chan at
200 MHz.  (Virtex datasheet says 200 MHz chip-to-chip using "HSTL class IV"
signaling.)  The FPGA at (x,y,z) transmits E to (x+1,y,z), N to (x,y+1,z), U
to (x,y,z+1) and receives W from (x-1,y,z), S from (x,y-1,z) and D from
(x,y,z-1).  Assume the cross-board up/down channels only run at 50 MHz for
100 MB/s/chan.

Peak bisection bandwidth of 4*4*100MB/s = 1.6 GB/s "sliced between boards"
and 4*16*400 MB/s = 25 GB/s sliced vertically.

Peak external memory bandwidth of 256 * 128/8 * 100 MHz = 400 GB/s.  Peak
internal memory bandwidth to block RAMs = 256 FPGAs * 16 blocks * 2-ports *
2B/port * 100 MHz = 1.6 TB/s.

While these point-to-point meshes have excellent bandwidth, they have
relatively high latency and seem complex and expensive to implement if
communication is irregular.  For interconnecting a few hundred FPGAs in a
scalable shared memory MIMD, I prefer a simpler 2D or 3D meerkat-like
interconnect with multiple buses in the X, Y, and Z dimensions such that
FPGA at (x,y,z) interconnects to (*,y,z) on the X[y][z] bus, (x,*,z) on the
Y[x][z] bus, and (x,y,*) on the Z[x][y] bus.

(See "The Meerkat Multicomputer: Tradeoffs in Multicomputer Architecture",
Robert Bedichek Ph.D. thesis --

Latency: For the MIMD sketched above, each processor has a local 256
halfword i-cache.  I-cache misses and all data accesses are from uncached
RAM.  Local references to uncached RAM access local SDRAM in < 100 ns, much
less if reference hits an open page.  Non-local load/store transactions
would issue through the interconnect to a distant FPGA.  Fortunately memory
latency is less of an issue when each processor is single issue and has a
slow 20 ns clock.

Cost?  The raw IC cost of this hypothetical machine is very approximately:

272 XCV300-4BG352C at $344 each per Avnet web site at quantity 25 discount
544 64MB (8Mx64) PC100 SDRAM DIMMs at $88 each per chip merchant.

Jan Gray

Copyright © 2000, Gray Research LLC. All rights reserved.
Last updated: Feb 03 2001