A 32-bit RISC CPU


Superscalar FPGA CPUs >>
<< Porting lcc

Usenet Postings
  By Subject
  By Date

  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

  Multis and fast unis
  Inner loop datapaths

  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Google SiteSearch
Newsgroups: comp.arch.fpga
Subject: Re: FPGA for a 20k gates micro-controller
Date: 27 Sep 1995 08:18:09 GMT

In <4454u5$l1n@marina.cinenet.net> kirani-@cinenet.net (kayvon irani)
>	In response to the person that asked about implementing a 40MHZ
>	controller in an FPGA; I have to say that's probably impossible
>	given the speed and gate count of current devices. Gate-wise AT&T
>	parts with FPGA gate counts around 40K (according to AT&T) may
>	translate to 20-25K ASIC gate count. With high gate utilization
>	your max. speed will probably fall below 10MHZ.
>	Kayvon Irani
>	Lear Astronics
>	Los Angeles

It depends upon what fraction of the gates are datapath vs. control. 
Assuming the majority implement datapath functions, and with careful
design, contemporary FPGAs should provide adequate capacity and perhaps
acceptable speed.

For instance, the datapath of a 32-register 32-bit pipelined RISC can
fit nicely in the "left half" (10 of 20 columns of CLBs) of a XC4010. 
In my (paper) design, I use ten columns of CLBs, each 16 CLBs "tall";
((FG denotes application of F and G logic function generators, FF
denotes application of flipflops)):

1. FG (as 32x1 SRAM): reg file copy "A", even bits
2. FG (as 32x1 SRAM): reg file copy "A", odd bits
3. FG: multiplexor (of reg file A value or result bus forward); FF:
latch of operand "A".
4. FG (as 32x1 SRAM): reg file copy "B", even bits
5. FG (as 32x1 SRAM): reg file copy "B", odd bits
6. FG: multiplexor (of reg file B value or sign extended 16-bit
immediate from instruction register); FF: latch of operand "B".
7. FG: logic unit (A&B, A|B, A^B, A&~B); FF: latch of "result bus". 
The latter "write back" value is fed to the data-in ports of the two
copies of the register file in columns 1-2 and 4-5.
8. FG: adder (A+B, A-B); FF: PC register
9. FG: multiplexor (of adder and incremented PC value, feeds PC
register and MAR (from an idea from Philip Fredin)); FF: instruction
10. FG: incrementor (of PC register); FF: memory address register

The "result bus" is a 3state bus driven by tbufs at the adder, logic
unit, MAR mux (PC), and "data in" (from RHS of the 4010) columns.  For
sign or zero extended byte and halfword loads, another column of tbufs
drives 0s or 1s appropriately.  In addition, shift/rotate left or right
1, 2, or 4 bits uses 6 other columns of tbufs.

(This design also uses tbufs in the right half of the chip to do byte
and halfword extraction/alignment.  For instance, for a store-byte to
address 0x000003, the byte of data exits the datapath on bits D[7:0]
and is copied up to D[31:24] by tbufs.)

So, even half a "10,000 gate" 4010 has adequate capacity for a
respectable RISC datapath.

As for speed, 16 M instructions/s is doable in a 4010-5.  One critical
path in the above design is from A and B operand registers, through the
32-bit ripple carry adder, through tbufs onto the result bus, and
through the register forwarding multiplexor back to the A operand
register.  That's something like 3 ns + 33 ns + 10 ns + 5 ns +  <10 ns
routing delay = ~60 ns = 16 MHz.

In straight XC4010-5's, another speed issue regards the register file
made from SRAMs.  At 16 MHz, there is 60 ns to write back a result and
read new operand(s).  The above design uses a glitch generator approach
to create the SRAM write pulse, a technique Xilinx does not advocate.

However, with the new 4000E series parts, Xilinx introduces synchronous
SRAMs.  Based upon the 4000E spec sheet, the -3 parts are characterized
with a Twcts write cycle time of 14.4 ns, so you could do a write and a
read in 25 ns (as you require for 40 MHz) or, more straightforwardly,
30 ns/33 MHz.  If you need more speed and fewer registers, the dual
port configuration would permit concurrent register file write and read
access and thus 60+ MHz operation.

The 4000E also seems to have doubled the speed of the fast carry logic
and so the above 4010-5 datapath could probably run at 30-40 MHz in a
4010E-3.  However, I don't yet have the 4000E design software, I'm
quoting from spec sheets, I don't know if the 4000E parts are generally
available, etc., etc.  Your mileage may vary.

Jan Gray

Copyright © 2000, Gray Research LLC. All rights reserved.
Last updated: Feb 03 2001