FPGA SoC On-Chip Buses


On-chip Memory >>
<< Supercomputers

Usenet Postings
  By Subject
  By Date

  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

  Multis and fast unis
  Inner loop datapaths

  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Google SiteSearch
Subject: Re: Microcomputer buses for use inside FPGA/ASIC devices?
Newsgroups: comp.arch.fpga
Date: Sat, 24 Jul 1999 21:36:20 -0700

Wade D. Peterson wrote in message <7ndpnl$pcu$1@news1.tc.umn.edu>...
>I'm working on a project where we're doing a microcomputer bus (kind of
>VMEbus or PCIbus) for use *INSIDE* of FPGAs and ASICs.  It's for hooking
>system-on-chip (SOC) components together.  If anyone has done this before,
>know of any references to this kind of project, I'd like to hear about it.

>If anybody knows of similar technology, I'd like to hear about it.  If
there are
>more, then my intention is to start a FAQ database on our website for all

My 1995 J32 system had a 32-bit on-chip peripheral bus.  The left 60% of the
XC4010 was a 32-bit RISC processor, using a 32-bit long line bus to
multiplex amongst the various execution stage results (including add/sub,
logic, 1-, 2-, 4-bit shifts left and right, load data, sign extension data,
return address).  This used approximately 16x11=176 TBUFs.

The right half of the XC4010 was a 32-bit long line peripheral bus.  It had
4 byte-wide lanes.  The processor was byte addressable with byte, 16-bit
halfword, and 32-bit word data types.

Call the processor result bus P[31:0], the peripheral data bus D[31:0], and
the external RAM data bus XD[31:0].  I used these sets of TBUFs: (approx.
144 TBUFs + 32 OBUFTs):

* store byte, halfword, word:
D[7:0] <- P[7:0],
D[15:8] <- P[15:8],
D[31:16] <- P[31:16]

* load byte, halfword, word:
P[7:0] <- D[7:0],
P[15:8] <- D[15:8],
P[31:16] <- D[31:16]

* store various byte lanes to external RAM (OBUFTs)
XD[7:0] <- D[7:0]
XD[15:8] <- D[15:8]
XD[23:16] <- D[23:16]
XD[31:24] <- D[31:24]

* load various byte lanes from external RAM
D[7:0] <- XD[7:0]
D[15:8] <- XD[15:8]
D[23:16] <- XD[23:16]
D[31:24] <- XD[31:24]

* copy bytes/halfwords to upper byte lanes
D[15:8] <- D[7:0]
D[23:16] <- D[7:0]
D[31:24] <- D[15:8]

* copy bytes from upper byte lanes
D[7:0] <- D[15:8]
D[7:0]] <- D[23:16]
D[15:8] <- D[31:24]

In case you are interested, here is some of the source code which generated
this.  It is my own "CNets HDL", a C++ class library for emitting XNF.  ff()
is a flip-flop, tbuf() is a tbuf.  Note the use of tlocs (LOCs for TBUFs).

void Mem::emit(Control& c) {
  net(zad24n) = adn(23,20) == 0U;
  net(zad20n) = adn(19,16) == 0U;
  ff(selROM, zad24n & zad20n, c.marce, _, init(1));
  ff(selRAM, ~adn[23] & ~(zad24n & zad20n), c.marce);

  ackROM = start & selROM;
  ack = ackROM | ackRAM | ackUART;

  for (unsigned i = 0; i < 4; i++)
    bytesel[i] = (byte & ad(1,0) == i) | (half & ad(1,1) == (i>>1)) | word;

  // processor to internal dbus interface
  ff(doutbytet, ~write, start, _, init(1));
  ff(douthalft, ~(write & (byte|half)), start, _, init(1));
  ff(doutwordt, ~(write & (byte|half|word)), start, _, init(1));

  // dbus internal/external interface:
  // emit 3state drivers to copy external dbus to/from internal dbus
  bus(dbusin, cbit);
  bus(dpads, cbit);
  for (i = 0; i < cbit; i++) {
    iopad(dpads[i], ploc(dpadlocs[i]));
    ibuf(dbusin[i], dpads[i]);
    unsigned t = 1 + even(i);
    tbuf(xd[i], dbusin[i], dinbyteextt[i / 8]);
    obuft(dpads[i], xd[i], doutextt);

  // byte/halfword load/store alignment logic
  ff(b1b0t, ~( write & byte & ad[0]),                     start, _, init(1));
  ff(b2b0t, ~( write & (byte|half) & ad(1,0) == 2),       start, _, init(1));
  ff(b3b1t, ~( write & ((byte&(ad(1,0)==3))|(half&ad[1]))), start, _, init(1));
  ff(b0b1t, ~(~write & byte & ad[0]),                     start, _, init(1));
  ff(b0b2t, ~(~write & (byte|half) & ad(1,0) == 2),       start, _, init(1));
  ff(b1b3t, ~(~write & ((byte&(ad(1,0)==3))|(half&ad[1]))), start, _, init(1));
  for (i = 0; i < 8; i++) {
    unsigned t = 1 + even(i);
    tbuf(xd[i+ 8], xd[i   ], b1b0t, tloc(rowForBit(i+ 8),20,t));
    tbuf(xd[i+16], xd[i   ], b2b0t, tloc(rowForBit(i+16),20,t));
    tbuf(xd[i+24], xd[i+ 8], b3b1t, tloc(rowForBit(i+24),19,t));
    tbuf(xd[i   ], xd[i+ 8], b0b1t, tloc(rowForBit(i   ),19,t));
    tbuf(xd[i+ 8], xd[i+24], b1b3t, tloc(rowForBit(i+ 8),18,t));
    tbuf(xd[i   ], xd[i+16], b0b2t, tloc(rowForBit(i   ),17,t));

The on-chip "peripherals were a UART and on-chip RAM and ROM, enough to boot
and print a "hello world" message.  There was also an integrated DRAM

You can see a floorplan of this at

Old articles which touched on this subject:

Recently I designed another on-chip bus with particular
CPU-to-bus-controller and bus-controller-to-peripheral interfaces.  Please
write me for more information.

Jan Gray

Subject: Re: Microcomputer buses for use inside FPGA/ASIC devices?
Newsgroups: comp.arch.fpga
Date: Sat, 24 Jul 1999 21:48:41 -0700

I wrote:
>...The left 60% of the XC4010 was a 32-bit RISC processor.
>...This used approximately 16x11=176 TBUFs.

Sigh.  Rather, 32x11 = 352 TBUFs.

Jan Gray

Subject: Re: Microcomputer buses for use inside FPGA/ASIC devices?
Newsgroups: comp.arch.fpga
Date: Mon, 26 Jul 1999 21:34:30 -0700

Wade D. Peterson wrote in message <7nf1rv$5r$1@news1.tc.umn.edu>...
>1) When you say "on-chip peripheral bus" is this your terminology, or are
>refering to a so-called 'OPB' bus that I'm seeing on some cores?  For
example, I
>believe that ARM processors use something called an 'OPB' bus.

My terminology, just a descriptive phrase.  (It hosted on-chip memory
elements and peripheral elements and interfaced to off-chip memory.)

>2) Do you think your peripheral bus is portable across multiple FPGA
>architectures, or is it limited to Xilinx?

It is port-able, but not especially so, portability was not a design goal.

1. design tool: the CNets C++ class library, would need to be retargeted.
Easy for Orca or Virtex, somewhat less so for other families.

2. implementation: used generic logic expressions and flip-flops, but there
were lots of 3-state buffers, and the design was optimized using LOC
constraints that would not apply to a non-XC4000.

3. interfaces (signaling): would work unchanged across architectures.

(I do not propose the J32 bus for any purpose.  I thought it might of
historical interest.)

>> Old articles which touched on this subject:
>I tried these links, but they appear to be dead.

Try again!

>> Recently I designed another on-chip bus with particular
>> CPU-to-bus-controller and bus-controller-to-peripheral interfaces. ...
>Do you have anything written up on these.

Sorry, the docs are not yet ready for publication.  But I think some of the
design space issues are:

* zero, one, or more processors? on-chip or off-chip processor? :-)
* clocking -- do CPU clocks equal bus clocks?  1-1? 2-1? 1-2?
* processor has one memory port or two (Harvard)?
* one bus (share processor result bus with on-chip data bus) or two?
* any access to processor resources (e.g. reg file ports)?
* byte addressing? byte/halfword/word types?  byte-lane shifting?
* is the on-chip bus connected to an off-chip I/O or memory bus?  same
width?  same clock discipline?
* wait state insertion?
* multi-master? arbitration?
* interrupt requests?
* DMA requests?
* pipelined bus transactions?
* split transactions?

In my current work-in-progress, the bus is: 1-1 with on-chip CPU's clock,
Harvard, one bus, byte addressable, byte/16-bit-word data types, attached to
a double-cycled external data bus, with arbitrary wait-states, interrupts,
DMA, and pipelined bus transactions.

Other comments.

FPGA Device Architects: this on-chip bus stuff is so much easier if you
follow the XC4000 lead and provide the abstraction of long, wide,
partitionable buses with *abundant* 3-state drivers -- one per logic cell is
good.  The bus control itself can be built in programmable logic.

Finally, in designing a on-chip bus with an eye on standardization, note
some interesting design tensions:

1. malleable or fixed bus topologies and clocking disciplines? -- why not
take advantage of FPGA flexibility and define a general bus architecture
space, making allowance for one or more 8-, 16-, 32-, even arbitrary k-bit
buses, and other dimensions of the design space I described above?  Then
customers can specialize designs to suit.  -- Oops, that adds complexity and
makes validation much harder.

2. lightweight or heavyweight?  My current bus has a control overhead of ~2
CLBs per peripheral.  At the opposite extreme, imagine an on-chip PCI bus.
The latter would offer many features, like configuration registers, but
these would be of little value in a cheap SOC in an XCS10XL or 20XL.

I can't wait to see an on-chip bus standard (or standards) for FPGAs -- then
we might finally see a marketplace of plug-and-play processors and
peripherals cores.

Jan Gray

Copyright © 2000, Gray Research LLC. All rights reserved.
Last updated: Feb 03 2001