Floating Point


Using block RAM >>
<< Register files (2)

Usenet Postings
  By Subject
  By Date

  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

  Multis and fast unis
  Inner loop datapaths

  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Google SiteSearch
Subject: Re: Floating point on fpga, and serial FP adders
Newsgroups: comp.arch.fpga
Date: Sat, 3 Jul 1999 10:43:33 -0700

Roland Paterson-Jones wrote in message <377DC508.D5F1D048@bigfoot.com>...
>It has been variously stated that fpga's are no good for floating point
>operations. Why? As I see it, floating point operations are typically
>just shifted integer operations. Is the bit-width problematic?

For 16-bit floats with (say) 10 bit mantissas, FPGAs should be *great* for
floating point.  Indeed problems start with the wider bit-widths.  The
area-expensive (and worse than linear scaling) FP components are the barrel
shifters needed for pre-add mantissa operand alignment and post-add
normalization in the FP adder, and of course the FP multiplier array.

The FCCM papers on this subject include:

Ligon et al, A Re-evaluation of the practicality of floating-point
operations on FPGAs, FCCM 1998

Louca et al, Implementation of IEEE single precision floating point addition
and multiplication on FPGAs, FCCM 1996

Shirazi et al, Quantitative analysis of floating point arithmetic on FPGA
based custom computing machines, FCCM 1995

and the neat Leong paper on avoiding the problem entirely,

Leong et al, Automating floating to fixed point translation and its
application to post-rendering 3D warping, FCCM 1999

See the Ligon paper for a nice presentation of speed-area tradeoffs of
various implementation choices.  Ligon estimates their single-precision FP
adder resource use at between 563 and 629 LUTs -- 36-40% of a XC4020E.  Note
this group used a synthesis tool; a hand-mapped design could be smaller.

Put another way, that single precision FP adder is almost twice the area of
a pipelined 32-bit RISC datapath.  Ouch.

The rest of this article explores ideas for slower-but-smaller FP adders.

The two FP add barrel shifters are the problem.  They each need many LUTs
and much interconnect.  For example, a w-bit-wide barrel shifter is often
implemented as lg w stages of w-bit 2-1 muxes, optionally pipelined.

Example 1: single-precision in << s, w=24
  m0 = s[0] ? in[22:0] << 1 : in;
  m1 = s[1] ? m0[21:0] << 2: m0;
  m2 = s[2] ? m1[19:0] << 4 : m1;
  m3 = s[3] ? m2[15:0] << 8 : m2;  // 16 wires 8 high
  out = s[4] ? m3[7:0] << 16 : m3; // 8 wires 16 high
5*24 2-1 muxes = 120 LUTs

Example 2: double-precision in << s, w=53
  m0 = s[0] ? in[51:0] << 1 : in;
  m1 = s[1] ? m0[50:0] << 2: m0;
  m2 = s[2] ? m1[48:0] << 4 : m1;
  m3 = s[3] ? m2[44:0] << 8 : m2; // 45 wires 8 high
  m4 = s[4] ? m3[36:0] << 16 : m3; // 37 wires 16 high
  out = s[5] ? m4[20:0] << 32 : m4; // 21 wires 32 high
6*53 2-1 muxes = 318 LUTs

In a horizontally oriented datapath, the last few mux stages have many
vertical wires, each many LUTs high.  This is more vertical interconnect
than is available in one column of LUTs/CLBs, so the actual area can be much
worse than the LUT count indicates.

BUT we can of course avoid the barrel shifters, and do FP
denormalization/renormalization iteratively.

Idea #1: Replace the barrel shifters with early-out iterative shifters.  For
example, build a registered 4-1 mux: w = mux(in, w<<1, w<<3, w<<7).  Then an
arbitrary 24-bit shift can be done in 5 cycles or less in ~1/3 of the area.
For double precision, make it something like w = mux(in, w<<1, w<<4, w<<12),
giving an arbitrary 53-bit shift in 8 cycles.

Idea #2: (half baked and sketchy) Do FP addition in a bit- or nibble-serial

To add A+B, you

1) compare exponents A.exp and B.exp;
2) serialize A.mant and B.mant, LSB first;
3) swap (using 2 2-1 muxes) lsb-serial(A.mant) and lsb-serial(B.mant) if
A.exp < B.exp
4) delay lsb-serial(A.mant) in a w-bit FIFO for abs(A.exp-B.exp) cycles;
5) bit-serial-add delay(lsb-serial(A.mant)) + lsb-serial(B.mant) for w
6) collect in a "sum.mant" shift register
7) shift up to w-1 cycles (until result mantissa is normalized).

It may be that steps 4 and 6 are quite cheap, using Virtex 4-LUTs in shift
register mode -- they're variable tap, right?

It is interesting to consider eliminating steps 2, 6, and 7, by keeping your
FP mantissa values in the serialized representation between operations,
counting clocks since last sum-1-bit seen, and then normalizing (exponent
adjustment only) and aligning *both* operands (via swap/delay) on input to
the next FP operation.  A big chained data computation might exploit many
serially interconnected serial FP adders and serial FP multipliers...

Is this approach better (throughput/area) than a traditional pipelined
word-oriented FP datapath?  Probably not, I don't know.  But if your FP
needs are modest (Mflops not 100 Mflops) this approach should permit quite
compact FP hardware.

(Philip Freidin and I discussed this at FCCM99.  Thanks Philip.)

Jan Gray

Copyright © 2000, Gray Research LLC. All rights reserved.
Last updated: Feb 03 2001