fpgacpu.org - Finding Max Element in an FPGA

Finding Max Element in an FPGA

Home

Floorplanning >>
<< Life in an FPGA

Usenet Postings
  By Subject
  By Date

FPGA CPUs
  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  Transputers
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

Multiprocessors
  Multis and fast unis
  Inner loop datapaths
  Supercomputers

Systems-on-a-Chip
  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

CNets
  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

Miscellaneous
  Floorplanning
  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Subject: Re: Efficient max-function architecture?
Date: 22 Sep 1998 00:00:00 GMT
Newsgroups: comp.arch.fpga

Some comments.

Algorithm analysis 101 tells us that to determine the largest item of n
items requires at least n-1 2-input "greater than" comparisons.  You might
build other circuit structures, e.g. using 3-input or 4-input max-functions,
of course, but I doubt they are as area efficient -- although I can't prove
that off the top of my head.

In the worst case, each comparison must consider each bit of its two
arguments, although there is a possibility (a probability) of "early out" if
some MSBs are different.  If you wish to take advantage of early-out, you
will have to design for the worst case (no early-outs occurred) or for
failure in the improbable case.

As you suggest, you can do the n-1 comparisons sequentially, in parallel,
pipelined, etc., depending upon your performance requirements.

Consider using dedicated carry chains (if available) to make your
comparators small and fast.  For example, an n-bit comparator requires
approximately n/2 XC4000 CLBs + approx 1/2 CLB to establish carry-in[0] as
0.

If the data arrives a byte at a time you could perform 1 load and n-1
comparison cycles using a single comparator and load-enabled register (e.g.
4.5 XC4000 CLBs, 1+15 cycles at ~10 ns/cycle).

If the 16 bytes of data arrive all at once, the pipelined tree of
comparators and muxes should do nicely (15*(4.5+4) = ~130 CLBs, 10
ns/cycle).

Hybrids: e.g. if you have 50 ns you can reduce the problem to a first step
where you iteratively find 4 largest of 4 subsets of 4, followed by 3
comparators+muxes, in area 4*4.5 +3*(4.5+4) = ~45 CLBs).

Jan Gray

Subject: Re: Efficient max-function architecture? -- "parallel bitwise max"
Date: 22 Sep 1998 00:00:00 GMT
Newsgroups: comp.arch.fpga

I wrote in message <6u8s8g$l5s$1@news-1.news.gte.net>...
>Algorithm analysis 101 tells us that to determine the largest item of n
>items requires at least n-1 2-input "greater than" comparisons. ...

SEGUE

Algorithm analysis 101 also teaches us to carefully choose a model (e.g.
elements and a 2-element comparison function) and a goal (e.g. minimize no.
of comparisons to determine the largest element) to represent the problem.
Only then can we compare algorithms or study optimal lower bounds on
algorithms.

For example, using a comparison based model, it can be shown that any
sorting algorithm on n elements must perform at least ceil lg n!
comparisons.  But there are other sorting algorithms which do not perform
any elementwise comparisons.  They assume a different model with different
operations on the elements.  Consider radix sort, in which we distribute the
n elements into r buckets, according to increasingly significant digits in
their base r representation, and then regather them and repeat.  (Remember
the old IBM punch card sorters?)

So, as radix sort is to a comparison-based sort, is there an analogous
"radix max" to our comparison-based max?

"PARALLEL BITWISE MAX"

Yes.  Here's a 'digit'al method for finding the max of n k-bit words.

We scan the n inputs as n serial bit streams, all clocked together, msbs
first.  One bit stream is the maximum if it is never observed to be less
than some other bit stream.  For any two bit streams A and B,
  "A < B" iff ~msb(A)&msb(B) | (msb(A)==msb(B) & "lsbs(A) < lsbs(B)").

Design: we keep n candidate-for-max state bits, one for each bit stream.
All are initialized to 1, as each stream is initially a candidate to be the
maximum.  A candidate stream is still a candidate if it has a current 1
input bit.  A stream with a current 0 input bit loses its candidacy if there
is some other remaining candidate stream that has a current 1 bit.  After
all input bits have been clocked past, the remaining candidate is the
maximum bit stream.  If duplicate input bit streams are possible, there can
be more than one remaining candidate, each of which corresponds to the same
maximum value.

  // pidgin netlist generator source:

  input reset; // 1 during the reset cycle
  input stream[n]; // current bit of each of the n input streams
  reg cand[n]; // 1 if the i'th stream is still a candidate for maximum.
  net some1; // 1 if some remaining candidate stream input bit is currently 1
  output answer; // result, the index of max input stream

  some1 = cand[0]&stream[0] | cand[1]&stream[1] | ... | cand[n-1]&stream[n-1];
  for (i = 0; i < n; i++)
    cand[i] := reset | cand[i]&(stream[i] | ~stream[i]&~some1);

If we know the input words are never two alike, then bitcount(cand)=1 and we
can use
  answer = encode(cand);
otherwise,
  answer = priority_encode(cand);

Implemented in an XC4000 FPGA, for n=16, requires approximately
  4.5 CLBs for some1
  8 CLBs for cand[16]
  12.5 CLBs for a 16-to-4 priority encoder
----
~25 CLBs, result every 9 clocks for 8-bit input data, at perhaps 10
ns/clock.

Radix 4 (2 bits/clock) is also possible but probably unwieldly and slow.

The nice thing about this approach is it is readily scalable to many inputs;
the slowest part of the design would be the 'some1' or-reduction circuit,
and even with 36 input streams this is only three CLB delays and mostly
local interconnect.

Jan Gray

Subject: Re: Efficient max-function architecture? -- "parallel bitwise max"
Date: 30 Sep 1998 00:00:00 GMT
Newsgroups: comp.arch.fpga

Brad Taylor wrote in message <3611D95A.527FC010@emf.net>...
>                       /=====REG==\
>                       |   [MAX]  |
>               [MAX]   \==>|a o|==/
> [hi_byte ]===>|a o|==REG=>|b__|
> [lo_byte ]===>|b__|

I like it!  Certainly nicer than the 4-way hybrid in my first posting.

To recap the thread and add yet another approach, if you have n inputs each
m bits long, some choices are :-

1. simple m/2+1 CLB max accumulator in ~n clocks
2. ~nm/2 CLB max-mux tree in 1 clock
3. hybrid which takes ~nm/2k CLBs in k clocks
4. "parallel bit serial max" in ~n CLBs in ~m clocks
5. "serial bit serial max" in ~lg m + m/16 CLBs in m*n clocks

About 5: use a bit serial max state machine together with an m-bit FIFO
implemented using dual port RAM, to store the "current max".  Operate by
streaming in all the input words, serially, one bit at a time, into the max
machine.  Each m clocks it bit serially emits the max of all inputs so far.

Jan Gray

Subject: Re: Efficient max-function architecture? -- "parallel bitwise max"
Date: 30 Sep 1998 00:00:00 GMT
Newsgroups: comp.arch.fpga

Jan Gray wrote in message <6usptm$d1$1@news-2.news.gte.net>...
>5. "serial bit serial max" in ~lg m + m/16 CLBs in m*n clocks
>
>About 5: use a bit serial max state machine together with an m-bit FIFO
>implemented using dual port RAM, to store the "current max"...

Oops, we don't need an m-bit FIFO, but rather a simple m-bit shift register.
Looking to http://www.xilinx.com/xapp/xapp052.pdf for inspiration, the m-bit
SR can be implemented in just 2 + m/32 CLBs so I should rather have written:

5. "serial bit serial max" in ~4 + m/32 CLBs in m*n clocks

Jan Gray