Homebuilt Processors


Altera, Xilinx Announce >>
<< Why FPGA CPUs?

Usenet Postings
  By Subject
  By Date

  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

  Multis and fast unis
  Inner loop datapaths

  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Google SiteSearch
Newsgroups: alt.comp.hardware.homebuilt,comp.arch.fpga
Subject: homebuilt processors using FPGAs (long)
Date: 11 Dec 1994 04:08:40 GMT

(Hope the crosspost to comp.arch.fpga is OK, the topic is amateur
processor implementations using FPGAs.)

In <3c6is4$d7k@gordon.enea.se> pefo-@enea.se (Per Fogelstrom) writes: 
>PDP11 Hacker ..... (ard-@siva.bris.ac.uk) wrote:
>: My main interest is in designing a CPU _from scratch_. OK, I know I'll get
>: poor performance from all those FPGAs wire-wrapped together (all that
>: capacitive loading for one thing), but with a good underlying design it should
>: be useable (heck.. The PERQ 1a was had a CPU built from 250 TTL chips, PALs 
>: PROMs, clocked at 5MHz, and still beats this 386DX33 for graphics performance
>: :-)). And there's the joy when a prompt appears on a machine that you even
>: designed the instruction set for.
>I've did a few bitslice designs many years ago. One was for my own amusement
>and was based on AMD2903 slices (32 bits, 8 chips). It was fun but very time-
>consuming. It was clocked with 5Mhz and executed reg-reg instructions in
>two clocks. I later redesigned it to fetch and decode in the same cycle as
>the previous execute. It never ran any serious software.

On homebrew computers: start simple and learn as you go.  When they work
they are *very* satisfying. I was encouraged by helpful U.Waterloo
hardware hacker friends (thanks Ashok and Mike and co., wherever you
are) into building my first homebrew 6809 system -- the "Gray-1", in
12th grade about 14 years ago.  It started with ROM, SRAM, and LEDs,
and gradually acquired serial ports, video, and a Votrax speech
synthesizer.  Eight bit micros and 1 MHz clock rates are easy to do:
easy to wire wrap, and easy to program.  Start with one of those; PICs
look like a good choice today.

On homebrew processors: I went into the software biz but my love for
hardware and computer architecture remains.  I've always been envious of
the engineers in industry and academia who get to design and build new
processors.  For a hobbyist, custom VLSI, gate arrays, or standard cell
has these hugely expensive barriers to entry.  And only the most
determined hobbyist would build a useful 32-bit CPU using bitslice

In the years since, the programmable logic industry has arrived!  These
days you can buy, quantity one, 5,000 gate field programmable gate
arrays (FPGAs) for ~$100, and 10,000 gate parts for about ~$200.  The
beauty of these parts is they are adequately dense for implementing
processors and they abstract away a lot of the high speed circuit stuff
for you.  For instance, clock skew is of little concern.  If you stick
to fully synchronous designs (no async preset/clear, no gated clocks,
etc.), carefully floorplan your functional units, and stay on chip :-),
your designs have a good chance of working at 20-25 MHz.

In my copious spare time I am experimenting with homebrew RISC CPUs.
Right now I have a partially finished, partially functional 16-bit RISC
CPU and ambitions for a dual issue 32-bit CPU.  The former ("jr16") is
compiled for a Xilinx XC4005PC84C-5, the latter ("NVLIW1" -- "not very
long instruction word #1") will be for a XC4010PC84C-5.

jr16 is a pipelined 16 16-bit register, 3-operand, load/store RISC.  The
basic instruction formats are:
{  0, op: 3, rd: 4, ra: 4,  rb: 4  /* add/logic operations   */ },
{ 10, op: 2, rd: 4, ra: 4, imm: 4  /* load/store, EA=ra+imm4 */ }, and
{ 11, op: 2, rd: 4,        imm: 8  /* load immediate, branch */ }.

Instruction pipeline is the classic IF (insn fetch), RF (write back
previous result and reg fetch), and EX (execute add/logic/effective
address computation.)  If there's a load/store the pipeline stops until

The 16-bit datapath is 8 rows by 5 columns of CLBs (Xilinx Configurable
Logic Blocks) (only ~20% of an XC4005 which has an array of 14x14 CLBs).
The columns are: rfa (reg file read port A), rfb (reg file read port B),
mux (multiplex B or immediate data), adder, logic unit (and, or, xor,
xnor).  Results (add/logic/load data) are multiplexed into a write-back
register on long lines (LLs) using the XC4000's dedicated LL tristate

For this first design I avoided a separate PC incrementor and associated
multiplexors and instead use r15 for a PC.  Thus the clock phases are:

phase	register file		exec. unit	load/store

1	write back result reg	add 2 to PC	latch insn, read another
2	read next A, B regs	add 2 to PC
3	write back PC		user insn add/logic
4	read PC			user insn add/logic

(The execution unit takes two clocks to add/mux result at (unproven) 40

A nice aspect of this design is the alternating inc-PC and user-insn
cycle means that the previous user insn finishes and any results are
written back to the reg file before the next user insn operands are
read, thus eliminating any need for bypass multiplexors in the operand
busses or ugly operation latencies in the programming model.

To date I have this design running using the 11 MHz Xilinx XChecker
circuit probe, incrementing PC, fetching instructions from an on-chip
16-word boot ROM, and performing ALU operations, but haven't yet
implemented condition codes, branch or load/store circuitry.  Soon!  (I
know it works as far as it does because I can verify internal state: the
XChecker probe allows you to examine the state of every function
generator and flip flop on the part.)

As for top speed, XDelay static timing analysis (I don't have the
simulator software) indicates I should be able to clock this at 40 MHz
(25 ns).  (I do have a critical path or two to better pipeline yet).
Thus it should do 10 peak MIPS, not too shabby for a first design.

One neat thing about the Xilinx XC4000 architecture (and I haven't
seriously looked at the other FPGA vendor's architecture's to know if
this is unique or inferior or superior) is there are enough flip flops
mixed in with the function generators that you can make a RISC datapath
in as few as three columns of CLBs: one register file (that you have to
take two clocks to read two operands), one adder, one logic unit, result
multiplexing being done on the LLs using tristate drivers).  And using
the dedicated carry paths you can do 16-bit adders in 9 CLBs, delay
about 25 ns, and 32-bit adders in 17 CLBs, delay about 35 ns.

As for the dual-issue 32-bit NVLIW1 my current plans are for a two-unit
implementation of a simple VLIW achitecture.  Each "unit" has a separate
16 32-bit register file, and 3 operand instructions (rdest = ra op rb),
rdest and rb are local to the unit, specified using a 4-bit reg no., but
ra can be read from either unit, and is spec'd using 1+4 bits.  Thus a
2-unit machine has a basic 34-bit insn word:
{ op0: 4, rd0: 4, ra0: 5, rb0: 4, op1: 4, rd1: 4, ra1: 5, rb1: 4 }.

(I'd obviously like to get that 34-bit word down to 32-bits but there
isn't much fluff left.  Any ideas out there?  32 - 2*(4+5+4) = 6, and
six bits doesn't encode two operations very well...)

Using the above "modestly decoupled" architecture, a separate PC
incrementer, bypass result multiplexing, a VLIW-like limited access
between register files/functional units, it should do peak two
instructions in two clocks at 25 ns, or 40 MIPS.  Here, the columns of
functional units in the data path floor plan will be something like
  (L=logic unit, A=adder, MM=4-way A-bus source mux,
  RRR=3-read 2-write register file)
with the two halves being placed such that splitting the
LL bus lets me mux the adder or logic unit results of each concurrently.

Thus the datapath of this 32-bit dual-issue machine should fit nicely in
14 columns X 17 rows of a 20x20 XC4010.  On a 4013 (24x24) I would add
a 16-entry 256-byte direct mapped cache (16 16-byte lines) whose cache
and data SRAMs would burn another 5 rows by 16 columns.  On a 4025,
(32x32) ...

It is amazing what you can squeeze onto these parts if you design the
machine architecture carefully to exploit FPGA resources.  In contrast,
there was a very interesting article in a recent EE Times by a fellow
from VAutomation doing virtual 6502's in VHDL, then synthesizing them
down into arbitrary FPGA architectures.  Although the 6502 design used
only about 4000 "ASIC gates" it didn't quite fit in a XC4010, a so-
called "10,000 gate" FPGA.  That a dual-issue 32-bit RISC should fit,
and a 4 MHz 6502 does not, states a great deal about VHDL synthesis
vs. manual placement, about legacy architectures vs. custom ones, and
maybe even something about CISC vs. RISC...

Well, that serves as kind of a brain dump of work (play) in progress.
Please drop me a line if you have questions, advice, etc.

Jan Gray

Copyright © 2000, Gray Research LLC. All rights reserved.
Last updated: Feb 03 2001