Altera Flex10KE CPUs


Multiprocessors >>
<< Flex10K CPUs

Usenet Postings
  By Subject
  By Date

  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

  Multis and fast unis
  Inner loop datapaths

  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Google SiteSearch
Subject: Re: Dual port, new Altera FLEX 10KE EABs
Date: 22 Mar 1998 00:00:00 GMT
Newsgroups: sci.electronics.components,,

Ray Andraka wrote in message <>...
>Contrasted with the memories in other FPGAs, such as the EAB in the
>altera, this is a better set-up IMHO.  ALtera EABs get 'dual porting' by
>cycle splitting...that means that you get half the memory bandwidth, and
>more importantly no simultaneous read/write at different locations.

Ah, but Altera just announced the new 10KE family.  See  Whereas the 10K(A) family
had single ported RAM via 256x8 EABs, the 10KE family apparently _will_ have
("first device shipments beginning in June") dual ported RAM (one write
addr/data port, one read addr/data port) via 256x16 EABs.

(Perhaps someone from Altera can explain the table on which shows different speeds for
16x32, 32x32, ..., 256x32 FIFOs, when presumably they are using two 256x16
EABs in each case, shouldn't they all be 150 MHz?)

Based on their marketing info, it appears 10KE family EABs will provide
twice the storage and four times the dual port bandwidth of the 10KA
family -- true dual porting plus twice as many bits wide per EAB.  Of course
this assumes there will be adequate interconnect resources to get all these
address and data bits to/from the EAB from/to the LABs.

With this development I may have to reconsider some things I wrote last
fall, in my comparison of the suitability of the Xilinx XC4000 and Altera
FLEX 10K architectures to implement RISC processor datapaths (attached
below).  I am delighted to see Altera has addressed my concerns regarding
the need for faster multiple port access and the need for x16 organization
of the EABs.  I can't help but wonder if my posting contributed to this
product announcement...

I still prefer an array chock full of distributed select-RAMs, over large
central EAB RAM blocks, but better still would be to have both.

"This is a great time to be us".  With the new 10KE family, the new ORCA 3C
family, and, sooner or later, the new Xilinx Virtex family, we have many
interesting projects ahead of us...

Jan Gray
attachment: old comparison of Altera vs Xilinx architectures for CPU

Subject: Re: FPGA based CPU ideas, and novel extensions => distributed RAM
and Altera CPUs
Date: 14 Oct 1997 00:00:00 GMT
Newsgroups: comp.arch.fpga

David Atkins wrote in message ...
>Any of these kicking around for Altera, if not for a good reason, ?
>Somehting of an interest but not in aposition to find the time for the
>money to get into, we use 10k10's at present and the techniques would be
>intersting, any pointer greatfully recieved.

(Disclaimer: I have studied but never used Altera devices.)

FPGA RISC CPUs, e.g. CPUs with adequate register files, can certainly be
implemented in the Altera FLEX 10K family, which has many nice features.

However, in my opinion, the Xilinx XC4000 architecture seems a better
platform (higher performance) for this application because of its
distributed RAM feature.  In particular, a simple RISC datapath benefits
from a 2-read, 1-write port register file.  In an XC4000, these can (in
theory) be built and run at up to about 10 ns/cycle using two banks of dual
port mode distributed RAM.  [tWCTS=9.0, 8.4, 7.7 ns in XC4000XL-3, -2, -1].
Of course to take advantage of this 66-100 MHz operation you need the deeply
pipelined even/odd ALUs I described in another recent posting.

In contrast, in a FLEX 10K device, you would use EABs (the 256x8 embedded
RAM blocks).  A 32x32 2-read 1-write register file would then require 3
cycles using 4 EABs, or 2 cycles using 8 EABs (two copies of the register
file), at (in theory) 10+ ns/cycle.  [tEAWRCREG and tEARCREG=11.6, 9.5 ns in
EPF10K50V-4, -3].  (Perhaps an Altera expert will provide more correct and
up-to-date information.)  Of course, an accumulator or stack oriented
instruction set architecture (with TOS in a register) could reduce the
average number of EAB accesses per cycle.

EABs could certainly excel at building LARGE register files (e.g. for vector
registers or multiple thread contexts or register windows), on-chip RAM,
ROM, caches, TLBs, cache tag RAMs for off-chip caches, etc.  Indeed an AMD
29000 style variable sized register window implementation might avoid enough
memory traffic to outperform a simpler 32-register RISC with half the cycle
time.  Might not.

Alas, compared to distributed RAM, EABs are often too narrow (256x8 instead
of 128x16) and coarse.  Take a simple I-cache design.  A (256 byte) 16-entry
by 4-word line by 32-bit I-cache in an XC4000 is one column of 16 CLBs for a
16x24 cache tag RAM, one column for a tag comparator and other control
logic, and four columns for a 4x16x32 cache data RAM.  Total approximately
6x16 CLBs, 10% of a 4025E, 3% of a XC4085XL.  A (512 byte) 2-way set assoc,
32-entry cache would be about 200 CLBs, still a small percentage of a large
device.  Whereas the smallest such 32-bit cache you can build from EABs is 4
EABs (both tags and data in same EABs) with two cycle cache access .  4 EABs
is 33% of the EAB resources in a 10K100.

Another feature XC4000 has but which FLEX10K lacks is TBUFs (3-state
drivers).  These are very handy for sharing one wide bus across chip.  In
the old J32 design, the processor half of the XC4010 uses almost every
available TBUF to drive many different results onto the "result bus",
destined for write-back into the register file:
* adder/subtractor
* logic unit
* operand A << 1, << 2, << 4, >> 1, >> 2, >> 4
* data-in (byte, halfword, word)
* sign extension of word/byte data-in for lbu/lbs/lhu/lhs
* next-PC (for jal (jump-and-link)) to save the next-PC into a register
* data-out during the first cycle of store instructions (not written back)

and the 32-bit on-chip data bus half of the XC4010 uses TBUFs for:
* various peripherals and boot ROM to return read data
* driving off-chip data-in onto the on-chip bus
* bus byte-lane shifting -- for instance for "lbu r1,3(r0)" (load byte
unsigned from address 3), we move data on mem.d[31:24] down to mem.d[7:0]

On the other hand, even the 10K10 provides an astonishing 3x144 FastTrack
row channels, so it seems straightforward to deliver even eight or ten
32-bit possible results to multiplexors implemented in LABs.

Assuming each EAB/row is responsible for 8 bits of the processor, a 10K10
might implement a splendid 16- or 24-bit RISC.  Furthermore you can always
implement a 32-bit processor with an 8- or 16-bit datapath, if you perform
several execute cycles per instruction.

Jan Gray

Copyright © 2000, Gray Research LLC. All rights reserved.
Last updated: Feb 03 2001