CPUs vs. FPGAs


Emulating FPGAs >>
<< Generators vs. synthesis

Usenet Postings
  By Subject
  By Date

  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

  Multis and fast unis
  Inner loop datapaths

  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Google SiteSearch
Subject: Re: advice request: CPUs vs. FPGAs again
Date: 20 Jan 1997 00:00:00 GMT
newsgroups: comp.arch.fpga

[This article is a repost of <01bc0692$176777e0$4e7dbacd@p5-166>,
cleaned up and with corrections regarding use of XC4000 H-muxes.]

Ben Chaffin <98bcc-@williams.edu> wrote in article
> Hello,
>         I am a computer science major at Williams College. I am working
> on a computational number theory problem for one of my professors. I have
> implemented it in C and optimized it to death and am now considering
> building a special purpose computer to speed it up. So, along with
> everyone else, I'm trying to find the best balance between a limited
> budget and speed. I have a design for discrete chips that should run
> about 5 times as fast as a P-120. The cost is $300-$400, which isn't bad,
> but the circuit involves nearly 200 chips, with about 3000 pins. That's a
> lot of wiring. The algorithm involves:
>         Two parallel 128-bit additions of internal-only data
>         Storage for the result (256 bits)
>         Two 48-bit conditional inversions using XOR gates
>         Two priority encodings across 48 bits
>         Some counters, flip-flops and simple control circuitry
>         Around 50 bits of output

Hooray for such questions, which help to give us insight
into the categorization of problems that are better done
on FPGAs than on general purpose computers.  In this
case, we compare a fast implementation on a general
purpose computer with a fast implementation on an
FPGA (a XC4010E-2) which seems well suited for the
problem.  I hope the discussion is of some indirect
use to Mr. Chaffin.

Let's consider how well this problem could run on a
superscalar 64-bit microprocessor.  The description
above would seem to require 2 adds, 2 adds-with-carry,
2 complements, 2 conditional moves, and 2 instances
of a priority encoder.  Each of the latter could be
implemented as two iterations of binary search followed
by a table lookup, e.g. as a shift, branch, shift, branch,
load byte (4KB table), add constant.  That's 30-40
clocks on a 2-integer issue machine, assuming all
four branches are mispredicted with a 5-cycle
misprediction penalty (as on a DEC Alpha 21164),
and mostly L1 cache hits.  With a 3 ns clock, that's
about 100 ns per iteration.

Now let's consider an FPGA implementation,
specifically a simpleminded, non-pipelined, no-sharing,
approach.  We'll use a Xilinx XC4010E-2 (20x20 CLBs),
$110 quantity one according to www.marshall.com.

128-bit adder:
Slow approach: 64 CLBs configured as a 128-bit fast
ripple carry adder (3+ columns of CLBs).
Latency approx. 3+64*.6 =42 ns.

Faster approach: 16+2*16+2*16+2*16 CLBs configured
as a 32+32+32+32-bit carry select adder.
Latency approx. 3+16*.6+5 gate delays+slop=25 ns.

[By the way, if a CLB's F-LUTs and fast carry
circuits are configured as a 2-bit fast ripple carry adder,
can you still use its H-LUT to multiplex one of its
F-LUT outputs with another signal, driving *that*
onto a X or Y output?]

Assuming H-muxes to multiplex the intermediate sums,
two such adders require about 240 CLBs, more than
half of the chip resources.  You can store the sums
in the flip-flops in the same CLBs.

(Note, we did not discuss the adder input values.
We may need to store an additional 256 or 512
bits of input state.)

Two 48-bit inversions require a total of 48 CLBs
if you can't fold them in elsewhere.
Latency: 2-3 ns of logic plus some interconnect delay.

48-bit priority encoder: 
A 4-bit encoder is 1.5 CLBs.
A 16-bit encoder is about 10 CLBs: 4 4-bit encoders
(6 CLBs) + about 4 CLBs to combine them.
A 48-bit encoder is about 36 CLBs: 3 16-bit encoders
(30 CLBs) + about 6 CLBs to combine them.
Therefore, two encoders require 72 CLBs.
Latency is about 4 gate delays (8 ns) plus
some interconnect delay => 15 ns.

Altogether, a non-pipelined implementation could
complete an iteration in 50 ns.  If there is a way to
overlap or pipeline the adder and the priority encoder,
you might get this down to 30 ns.

This is 2-3X faster than the general purpose CPU
approach.  However, it is also instructive to compare
the approximate latencies of the individual operations.

Adds: 6 ns (CPU); 25 ns (FPGA).
Inversions: 6 ns (CPU); <10 ns (FPGA).
Priority encoders: 90 ns (CPU); 15 ns (FPGA).

The CPU loses badly because of the expense of the
priority encoder implementation.  So, one approach
to the problem is to find a CPU with a priority
encoder (e.g. the Microunity Mediaprocessor),
or a not-yet-existent CPU with a little bit of
programmable logic in its datapath, with which
to implement a priority encoder...

This circuit just fits in the 400 CLBs of an XC4010E.
Two instances would fit in a XC4020E or XC4025E,
more in the promised XC4000EX family parts.
So, if this is not an inherently serial problem,
doing several iterations in parallel would show
a further performance advantage for the FPGA
approach.  Or, as Ray Andraka suggests in another
reply, if your problem tolerates a lot of parallelism,
a "massively" instanced bit serial approach will be
faster yet.

Jan Gray

Copyright © 2000, Gray Research LLC. All rights reserved.
Last updated: Feb 03 2001