Jan Feb Mar
Apr May Jun
Jul Aug Sep
Jan Feb Mar
Apr May Jun
Jul Aug Sep
Oct Nov Dec
Apr Aug Sep
Oct Nov Dec
Environment extensibility and automatic services for component applications using contexts, policies, and activators.
Filed August 17, 1998, granted August 27, 2002.
"An object system provides composable object execution environment
extensions with an object model that defines a framework with contexts,
policies, policy makers and activators that act as object creation-time,
reference creation-time and call-time event sinks to provide processing of
effects specific to the environment extensions. At object creation time,
an object instantiation service of the object system delegates to the
activators to establish a context in which the object is created. The
context contains context properties that represent particular of the
composable environment extensions in which the object is to execute. The
context properties also can act as policy makers that contribute
policies to an optimized policy set for references that cross context
boundaries. The policies in such optimized sets are issued policy events
on calls across the context boundary to process effects of switching
between the environment extensions of the two contexts."
My last project at Microsoft. It's the extensibility architecture
that we added to COM
to make it extensible enough to host automatic COM+ Services
such as MTS (Transaction Server).
This first shipped in Windows 2000.
I am very pleased that a second generation, more performant
version of the Contexts architecture is an important part of the .NET Framework architecture
(no thanks to me).
Mike Woodring's Context slides from last summer's
Dharma Shukla, Simon Fell, and Chris Sells, in MSDN Magazine:
Aspect-Oriented Programming Enables Better Code Encapsulation and Reuse.
Joel on Software: Platforms.
"It's really, really important to figure out if your product is a platform
or not, because platforms need to be marketed in a very different way to
be successful. That's because a platform needs to appeal to developers
first and foremost, not end users."
Exercise: can you reconcile Spolsky's Platform with the term Platform FPGA?
In the latter case, who are developers and who are end users?
Smarter than you
I used to give technical interviews to candidates at Microsoft,
one or two each week. One of the standard "clever insight"
interview questions was:
"How would you detect a cycle in a singly linked list?"
So, I used to ask this one, and kept pushing -- "Per-node mark bits? Good
answer, but can you do without 'em?" -- exploring one solution after
another, until the solution space had been pruned down to solutions with just
O(n) time and O(1) storage. There are several approaches which require only
a couple of words of state (total).
Sometimes you interview people who can't think of a single workable
algorithm, even without any constraints on the solution. And then,
sometimes, you meet someone whose mind navigates mysterious higher planes of
thought, someone an order of magnitude smarter than you or anyone
else you know.
Once, a colleague whose initials are G.Y., heard this problem and said
immediately "Ah yes, that's almost the same problem as finding the cycle
length of a pseudo random number generator whose next value depends only
upon its last value. There are two algorithms in exercises in Knuth." He
pulls Seminumerical Algorithms (IIRC) from my shelf and turns straight to
the relevant page. "See?" And there they were, isomorphic to my two
preferred solutions. Damn!
Or as I prefer to think of it, five point oh.
As you read this piece, consider:
Those that can, do.
Xilinx announces World's Fastest Software for Programmable Logic and System Design: ISE Version 5.1i.
Those that can't, write weblog entries about what those that could, did.
Michael Santarini, EE Times:
Xilinx overhauls FPGA software design package.
(I haven't yet received my copy, so these comments are based only
upon the press release and related materials.)
These announcements do show that Xilinx is working hard to make
it easier and faster to target their devices.
There does seem to be some there there, and many new
features and tools to master/remaster.
Besides the usual 2X speedup and quality of results claims
some of the more interesting new features
include Macro Builder,
Architectural Wizards, and
support for partial reconfigurability.
I also noted some things that weren't there.
Most notable was any description of functionality
that rivals Altera's SOPC Builder
A similar tool would seem to be the key enabler to help designers
get started exploiting PowerPC/MicroBlaze/CoreConnect-based SoC design.
Maybe it's in ISE 5.1i but not hyped in the marketing materials.
(Also missing: I wonder when and if Xilinx will productize the technologies they
purchased in the LavaLogic acquisition.)
"Throughout the remainder of 2002, Xilinx will roll out a series of
embedded and system-level ISE 5.1i family and EDA partner design
tools that enable on-demand architectural synthesis and flexible
I assume that means these tools are not in this release.
Oh well, there's not that much left of 2002...
On the subject of architectural synthesis, this
press release is worth a read and a ponder.
Maybe I'm comparing apples and oranges, but this Xilinx vision
seems to be broader, yet less focused on SoC design capture,
than the Altera SOPC Builder reality.
"Macro Builder: ... making it possible for customers and IP vendors to capture physical implementations of internally-developed IP; preserve placement information for timing-critical blocks; and ensure repeatable high-performance in future designs. Changes in other parts of a design do not affect macro performance, further supporting change without risk."
Is this RPMs for designers too lazy to figure out RLOCs?
Or too busy to use them?
RLOCs give you full control of placement, plus repeatable results.
With RPMs, you control the initial placement. There is no
pushing on a rope. What you want is what you get.
LUTs and DFFs in your datapaths go where you put them, period.
On the other hand, particularly for control logic,
adding placement constraint attributes is slow and tedious;
if Macro Builder can make it easy to build repeatably
floorplanned control units, that would be a major advance.
(As I have noted before,
RPMs can "halve critical path net delays, and,
most importantly, make the delays predictable so you have terra firma
upon which to make methodical implementation decisions".)
The current floorplanner allows manual control of placement,
but when you change your source code, too often the synthesizer
scrambles up all your synthesized net and instance names and
the floorplanning process must be repeated.
I wonder how Macro Builder helps address this issue, which always
seemed to me to be an unnecessary synthesis induced problem.
There is no reason synthesis tools cannot fabricate synthesized
names, in a deterministic way, based upon the topology of the circuit
nearby. Certain synthesized instances, such as slices of registers,
can be given repeatable names derived from identifiers in the source code.
(The 7th bit of register pc is of course pc<7>.)
And a gate between register X and register PC could be named
$lut_X_PC<7> or $lut_hash("X_PC<7>"),
instead of $lut1234.
Under this regime, no matter what changes you make to other parts of
your design, this particular synthesized LUT is going to be repeatably
(Well, if it were my problem to solve, first I'd lobby the synthesis
tools vendors to synthesize to repeatable, canonicalized names.
I strongly suggested doing exactly this to Xilinx two years ago.
This helps with incremental design respins too.
Perhaps this has happened.
In lieu of that, I'd start applying some graph and subgraph
isomorphism algorithms. Perhaps the problem can be recast as a simple
subtree isomorphism, so good old linear-time CSE value-numbering suffices.)
RLOCs or no, fundamentally, there is a skill set that tools like Macro Builder
cannot automatically apply. And that is
The Art of High Performance FPGA Design.
Elsewhere, in its description of PACE (Pinout & Area Constraint Editor),
Xilinx says it offers "design rule-driven floorplanning".
I wonder what that is.
Nothing in the Macro Builder literature mentions RPMs. Oh dear.
Please, Xilinx, please assure us that there is clean, smooth integration
and composability between RPMs of RLOCs and this new Macro Builder feature.
"For instance, the Digital Clock Managers (DCM) wizard and RocketIO
multi-gigabit transceivers (MGT) wizards let the user graphically
set DCM and MGT functions through dialog boxes available in the ISE
Project Navigator. ISE then writes editable source code directly into
the HDL source file to set and control these advanced capabilities. The
Architecture Wizards enable correct-by-construction HDL code, alleviating
the need to learn all of the programming attributes required to configure
these powerful flexible device features, thereby speeding the design
cycle. In a related announcement, Xilinx and Cadence delivered a kit
for designing with MGTs in Virtex-II Pro devices."
Earlier this year, on Virtex-II Pro,
"Xilinx and its partners are going to be challenged to tie a neat bow
around these technologies, using reference designs, using complexity
busting "easy IP", using tools like their forthcoming System Generator
for PowerPC, so that engineers new to SoC design, embedded processors, or
high speed interconnects, can successfully apply all this great silicon."
This seems to be a reasonable step in this direction. Of course, while
making it easy to author the HDL that correctly sets attribute bits
and wires the darn things up is very helpful, it is only one
small piece of the puzzle of using these advanced IP blocks.
"In a related announcement, Xilinx and Cadence delivered a kit
for designing with MGTs in Virtex-II Pro devices."
Perhaps the more significant announcement. Kits and reference designs
(including software), that's the ticket.
Partial reconfigurability FAQ.
Interesting, exotic, obscure, off-the-mainstream.
"Design communication between modules on TBUF bus macros".
TBUFs, the Rodney Dangerfield of Xilinx device primitives.
Maybe there's some hope for them after all.
And now for an essay. Ironically, I haven't been inspired to do
any new FPGA design work for several months now.
The Art of High Performance FPGA Design
The trick to getting best area and performance out of FPGA designs is to
not lose 50% quality of results here and there (and there again). In a
nutshell: first you have to acquire The Knowledge
so that you intuitively come to evaluate area and delay costs in terms
of FPGA primitives. Then you have to apply it through a set of
Best Practices (if you will forgive
the tired cliche).
If you want to be a cab driver in London, you first must earn
Students study for many months to memorize the thousands of little streets
in London and learn the best routes from place to place. And they go out
every day on scooters to scout around and validate their book learning.
Similarly, if you want to be a great FPGA-optimized core designer, you
have to acquire The (Device) Knowledge. You have to know what the LUTs,
registers, slices, CLBs, block RAMs, DLLs, etc. can and can't do. You
have to learn exactly how much local, intermediate, and long routing
is available per bit height of the logic in your datapath and how wide
the input and output buses to the block RAMs are. You have to learn
about carry chain tricks, clock inversions, GSR nets, "bonus" CLB
and routing resources, TBUFs, and so forth.
You also need to know the limitations of the tools. What device features
PAR can and can't utilize. How to make PAR obey your placement and timing
constraints, and what things it can't handle. And how to "push on the
rope" of your synthesis tools to make them emit what you already know
The Knowledge isn't in any book, alas. Yes, you can read the 'street
maps', e.g. the datasheets and app notes, but that only goes so far. You
have to get out on your 'scooter' and explore, e.g. crank up your tools
and design some test circuits, and then open up the timing analyzer and
the FPGA editor and pour over what came out, what the latencies (logic
and routing) tend to be, etc.
A slow FPGA design is usually one with either too many logic levels
and/or a bad placement that runs nets (slow programmable interconnect)
back and forth across half the chip. Too often the designer writes the HDL
without even knowing ahead of time what the FPGA result is going to be.
In contrast, a fast, compact FPGA design is one where the final FPGA
implementation is in mind from the start; where the eventual area and
cycle time results are no surprise. Here the designer knows what he
or she wants and what the fabric is capable of; and their task at hand
is simply coding the design so that the tools emit the desired set of
primitives and constraints.
Thoughtful technology mapping is crucial. When you design your
architecture, or when you code your HDL, you must always be conscious of
how your design will map into FPGA primitives. (In a custom VLSI design
you think in gate delay levels, so it is in FPGAs with LUT levels.) Then
you may have to run stripped down test designs through the synthesis and
place-and-route tools to double-check your technology mapping assumptions
are valid. Sometimes you have to spend an hour figuring out how to stand
on your head (push on a rope) to make the tools emit what you know the
FPGA fabric is capable of. Even a detail as small as how the global
sync or async reset net is coded (in terms of Verilog always @() idioms)
can make a significant difference. Sometimes you completely override your
synthesis tool and provide an explicit LUT-at-a-time technology mapping
to achieve the crucial trick that saves you a whole column of LUTs.
Floorplanning is an essential tool for managing interconnect
delays. Xilinx: Use RLOC relative-placement-constraints to hierarchically
build up hard macro blocks of programmable logic that are constrained
to a fixed relative placement. This is called an RPM (relatively placed
macro). RPMs provide predictable and repeatable performance. This is
important to your customers, but is mandatory to make delays consistently
repeatable, providing terra firma upon which to make methodical iterative
implementation optimization decisions. Without floorplanning, you're just
playing whack a mole on the critical path.
As I like to design in Verilog, and as Verilog lacks GENERATE statements
for parameterized macros, of late I usually write Python programs that
emit parametric RLOC-annotated structural Verilog, which (usually)
passes through the synthesis tool unmolested; these constraint-bearing
primitive instantiations are then properly respected by the mapping,
placement, and routing tools phases.
You must also pore over the interactive timing analyzer reports to
study where the critical paths are. Perhaps a blob of logic needs to be
rewritten. Perhaps some retiming is in order. Perhaps the RPM floorplan
must be rearranged to shorten some interconnect delay. Perhaps some
duplication of high fanout control registers will shave a half ns from a
cycle time. It is always necessary to add time constraints (cycle time
budgets) to the design, declaring cycle time, disclaiming false paths,
multicycle paths, and so forth, and to refine these constraints as the
design tunes up.
Above all, is iteration iteration iteration. While an expert can land in
the vicinity of an optimal design in short order, there is no substitute
for the grunt work of running the design (or experimental subsets thereof)
through the tools over and over and over again (a hundred times) while
you make little tweaks, redesign parts of datapaths, constrain and
further constrain mapping, placement, and timing.
If you have the opportunity, it also helps to iteratively evaluate and
modify architecture and implementation together. Sometimes small changes
to the architecture can save both area and time in the implementation.
Lattice's new ispXP FPGA Line|
A new commercial SRAM-based FPGA architecture is about as rare as
a total solar eclipse.
The two big companies hold such commanding positions and mindshare,
and patent portfolios, that new entrants to this
arena are few and far between.
But last month Lattice boldly jumped into the fray with the launch of
their ispXPGA family. (Recall
it was not so long ago that Lattice also acquired the ORCA 2,3,4 familes
from Agere/Lucent (nee AT&T Microelectronics).)
Lattice Semiconductor Introduces World's First Infinitely Reconfigurable Instant-On FPGA.
(I must note that Lattice apparently won't let you access
their data sheets and related literature online unless you register for a
Lattice Web Account.
What a clever way to drive potential customers away!
I suppose someone at Lattice decided it is better to capture the identities
of a determined few than to disseminate crucial information
on their newest products to as wide an (anonymous) engineering audience
Therefore, dear reader, in this instance, I have refrained from
"deep linking" to Lattice data sheets, white papers, app notes, and so forth.
If you want to know more, you will have to register for your
very own Lattice Web Account. Not that interested? I don't blame you.)
There you will find links to the data sheet and other literature.
The data sheet references a number of tech notes (TNxxx) but I was
unable to locate some of those.
Having reviewed the data sheet, the new ispXPGA looks to be a competent
4-LUT FPGA architecture. With 4 4-LUTs per PFU (programmable
function unit, think CLB), and some embedded RAM blocks, it is most
reminiscent of the Xilinx Virtex (not Virtex-II) family. Yet it also
echoes certain features of the Altera 10K family (PLLs).
Of course, it offers some unique new capabilities. Most notably, each
ispXPGA device has both SRAM-based active configuration memory, and
EEPROM non-volatile configuration memory. On power-up,
the device loads its SRAM configuration memory from its EEPROM.
This has several advantages. It is simpler. It uses less board real estate.
It is more secure (there is no off-chip configuration bitstream download
to capture). You can pre-program devices at your facility before
mounting them on PCBs or selling them to your customers.
You can of course still download a configuration to the SRAM.
You can also download a configuration to EEPROM without disturbing the current
configuration in SRAM. Loading a configuration from EEPROM is very fast,
and this is probably the first SRAM FPGA you can categorize as "instant on".
So in some ways it's rather like a 2-context FPGA with one of the contexts
Perhaps someone could help me reconcile the marketing statement
"infinitely reconfigurable ... FPGA", and the device data sheet that
guarantees the EEPROM for a minimum of only 1000 programming cycles.
ispXPGAs will be available in four sizes, from the 1936 LUT (and 92 Kb BRAM)
ispXPGA125, to the 15376 LUT (414 Kb) ispXPGA 1200.
The former is roughly comparable in size to the 1536 LUT (64 Kb) XCV50E,
the latter to the 13824 LUT (288 Kb) XCV600E. Size-wise, the ispXPGA family
is not competitive with some of the larger parts from Xilinx or Altera,
for example, the XCV3200E (~65,000 LUTs) and the XC2V8000 (~93,000 LUTs),
Let's review some of the more unusual elements of the architecture.
Overall, this looks to be a reasonable and plausibly competitive family of
devices, particularly if a few hundred thousand or one million
"system gates" of programmable logic is sufficient for your application.
The several conveniences, including VCC as high as 3.3V,
and the integrated EEPROM configuration memory with improved design
security and "instant on", are sufficiently attractive
that (if priced right), this family could certainly garner some design wins.
But the competition is intense, and in the smaller devices, Lattice must
strive to be price competitive with the formidable Spartan-IIE family
and comparable Altera offerings (taking into account the costs of
external FLASH config ROM).
The chip appears to accomodate a wide range of VCC voltages:
1.8V, 2.5V, and 3.3V. How?
Each PFU has 4 4-LUTs, a 4-bit carry-logic circuit, a wide-logic structure,
and 8 flip-flops. Each LUT also has an AND gate a la Xilinx's MULT_AND
for cheap and cheerful Booth multiplication. At each LUT, you can
register any two of the LUT output, the carry-logic-unit's sum output,
some LUT and SEL inputs, and/or the wide logic output. As Disman points
out, two registers per LUT may be a nice feature, but only if it is
exploited by your synthesis tool.
I don't know about the 2 'flops per LUT. My processor datapaths
are fully pipelined and they never need more than one flop per LUT.
On the other hand, sometimes in a floorplanned design you want to
register the datapath control signals, or even replicated control signals,
adjacent to, or better yet, embedded in, the datapath. This reduces
the interconnect delays. This architecture would accomodate that nicely.
On the other other hand, Virtex-II's buffered Active Interconnect
greatly reduces the need for careful control register placement
and replication (as fan out doesn't hurt nearly so much).
Also if a deeply pipelined datapath needs to delay certain results across more than
once clock cycle, then multiple 'flops per LUT might be just the thing.
The 4-LUTs can be configured as distributed RAM, up to 64 bits per
PFU (32 bits when dual ported). Details are not provided.
Since the ispXPGA family provides distributed RAM, apparently suitable for
building small fast register files, like Xilinx, it should be a good
platform for a small fast RISC processor soft core, or multiprocessor of same.
Each LUT can also be configured as an up to 8-bit shift register.
Speed seems competitive, with tLUT4 of 390 ps (-5 device) to 550 ps (-3),
as compared to Virtex-E data sheet tILO's of 350 ps (-8) to 470 ps (-6).
It's harder to compare adder, distributed RAM, or interconnect delays.
The "variable length" inter-PFU programmable interconnect sounds
Xilinx-ish, but details are not provided.
The horizontal and vertical long lines are "tri-statable" but details
are not provided.
There are numerous dual-ported embedded block RAMs, apparently in the center
and at the periphery of the device, but unlike Xilinx and Altera,
there does not seem to be the flexibility to use a wide
data bus on one port and a narrow one on another. In contrast,
you can configure a Virtex-II dual port block RAM with one 1-bit wide port
and one 32-bit wide port, which is quite useful for building high
speed SERDES FIFOs.
The Lattice BRAMs provide both synchronous and asynchronous read modes.
In async read mode, with WE deasserted, the DATA read changes tEBADDO
after ADDR changes, independent of CLK.
There are eight PLLs for clock multiplication and division (and
presumably, to eliminate clock delay (delay by 360 degrees)).
The sysIO I/O blocks seem roughly comparable to the programmable I/O
facilities provided in Xilinx and Altera FPGAs. There are 4-20
high speed differential serial blocks with integrated clock data
recovery, SERDES, and 8B/10B (and 10B/12B) coding.
(These days, the silicon is only half the story. Another factor in
the strong positions of both Altera and Xilinx is their FPGA development
tools products, which reflect over a decade of innovation, iteration,
and improvement. Thus Lattice must ship tools that reflect
Murray Disman, ChipCenter:
Lattice Introduces FPGA.
"An on-chip regulator allows the use of 1.8V, 2.5V, or 3.3V for the logic core's power supply."
Anthony Cataldo, EE Times:
Lattice lands programmable-logic combo punch.
Graham Prophet, EDN Europe:
High-density programmable logic uses dual-memory structure.
Applying FPGA SoCs
Jesse Kempa, Altera, at ChipCenter:
Maximizing Embedded System Performance in the Era of Programmable Logic (PDF).
A very nice article, based upon the task of speeding up a Nios
SoC-based HTTP server, illustrating that creative application of
programmable logic can deliver big speedups over a pure software approach.
"Implementing a microprocessor core in programmable logic offers many
ways to customize an embedded system to fit the performance goals of a
project that are not available in traditional design methodologies. The
performance boost of two simple optimization methods performed in the
above examples and combining these two in the final system has shown
that system performance is by no means solely dependant on clock speed
or Dhrystone MIPS. The continuing evolution in programmable logic devices
and tools carries with it the ability to rapidly create powerful systems
with close integration between hardware and software design."
Software defined radio
Some day, SDR might be a huge consumer of programmable logic silicon.
EE Times has a splendid
set of articles
that explores the technology and business considerations.
Loring Wirbel, EE Times:
Economics may rule out SDR, despite benefits.
Nice analysis, but ... what a bummer!
Chris Dick, Xilinx, in EE Times:
A case for using FPGAs in SDR PHY.
Nice survey of the current FPGA-based software defined radio space.
Additional articles from
Analog Devices, and
Anthony Cataldo, EE Times:
Actel pushes for better FPGA security safeguards.
"But when it comes to security, the SRAM-based FPGA has an Achilles'
heel. Such devices require a separate PROM memory, which stores
the configuration bits that are sent to the FPGA upon power-up. The
configuration bits are thus exposed en route to the FPGA, and can be
captured using a probe."
This should not be a problem in most cases, if the bitstream is
triple-DES encrypted (Virtex-II and Pro) or if the bitstream is
preloaded at the factory, with battery-backed up configuration memory,
assuming you can address the battery lifetime and field replacement strategy.
[07/24/02] Ken McElvain, Synplicity, in EE Times:
Future looks programmable.
A paean to FPGAs.
Creative accounting, marketing gates style
the XC2V8000 has "104,832" logic cells. Yet its data sheet states it is
112x104 CLBs and each CLB has 8 LUTs. That's 93,184 LUTs. Sigh.
More obituaries, alas|
Today, Google sports a link,
Edsger W. Dijkstra, 1930-2002."
How cool is that?
John Markoff, The New York Times: Edsger Dijkstra, 72, Physicist Who Shaped Computer Era, Dies.
Last week I posted an obituary for Ole-Johan Dahl,
one of the two designers of Simula, and fathers of object-oriented programming.
Unfortunately his colleague, Kristen Nygaard, has also just passed away.
Larry Tesler: Kristen Nygaard 1926-2002.
Home Page for [Hjemmeside for] Kristen Nygaard.
Dahl and Nygaard: How Object-Oriented Programming Started.
On 4/21-4/24, I attended FCCM'02, in Napa, CA.
Here's a rather belated, partial write-up.
Unfortunately I misplaced my Procedings and
my written notes therein, so some of this is from memory,
please forgive my mistakes.
It was an interesting, not stunning, conference. For me,
the highlight was not any particular presentation,
but the Tuesday evening session on programmable
logic in nanoelectronics. More on that below.
Monday -- day one
K.H. Tsoi et al,
A Massively Parallel RC4 Key Search Engine.
"A total of 96 RC4 decryption engines were integrated on a single Xilinx Virtex XCV1000E ... The resulting design operates at a 50 MHz clock rate and achieves
a speedup of 58 over a 1.5 GHz Pentium 4."
The presenter describes an elegant mapping of the RC4 algorithm to
FPGA fabric, including effiently embedding the S array and S lookups
in a block RAM. The resulting RC4 core is sufficiently compact that
it can be instantiated 96 times in a V1000E.
I appreciated that the designers used best practice FPGA design techniques,
including replacing LUT-based 5-1 muxes with TBUF based ones, and
RLOC-based floorplanning of their cores. That said, the design is
block RAM constrained and has lots of "white space".
The design uses the Pilchard
FPGA-in-an-SDRAM DIMM card
in a Linux PC platform to provide high bandwidth low latency interconnect
to the host.
This system searches 6M keys/s and can exhaustively search a 40-bit key
space in 50 hours.
The presenter notes that a further factor of six speed up would be possible
if they used a 2X faster clock (100 MHz) and an FPGA with 3X more block RAM
(such as the XCV812EM).
T. Mitra, NU Singapore, et al, An FPGA Implementation of Triangle Mesh Decompression.
"The first hardware implementation of triangle mesh decomposition."
In 3D rendering systems, a tesselated surface is represented
as a mesh of triangles. The triangles are sent from the geometry
engine (often a PC host) to the rendering engine. It is important to
compress the mesh data to reduce bus bandwidth requirements. If you walk
the triangles in a systematic order, it is easy to see how to typically
send only one vertex per new triangle (reusing the prior two vertices).
But you can do much better. By walking the triangles in a still more
clever order, for instance by systematically expanding a frontier
of triangles in a clockwise or counterclockwise walk from a seed triangle,
you can send fewer than one vertex per triangle, instead referring back
to previously sent vertices in a vertex "frontier buffer".
The presenter described some nice hardware optimizations so that
the frontier buffer can be efficiently implemented using an external RAM,
plus a small cache of left and right vertices adjacent to the current edge;
further the implementation takes advantages of perfect cache prefetching
possible due to the clever mapping of the buffer to hardware.
The presenter described an FCCM to implement this process in an FPGA,
specifically a PCI Pamette board. The system processes about 8 M triangles/s,
and reduces the triangle mesh bus bandwidth requirements by about 83%.
Nicholas Weaver, UC Berkeley,
The Effects of Datapath Placement and C-slow Retiming on Three Computational Benchmarks.
One of the benchmarks was a pipelined RISC datapath.
My recollection is that Weaver showed that using floorplanning and
3-slow retiming (3-threading) the datapath he was able to improve
speed from 50 MHz to 100 MHz.
Demo Night: JHDL xr16vx integrated development environment
I caught demo night presentations by BYU, CUHK, Altera, Xilinx,
Annapolis Micro, and others.
The demonstration given by the
Configurable Computing Laboratory of
Brigham Young University was wonderful.
Prof. Brent Nelson and his students, particularly Eric Roesler,
have taken Mike Butts'
reimplementation of the xr16 instruction
set architecture and run with it. They demonstrated an integrated
environment that included xr16vx, the xr16 compiler tools (ported to Linux,
and with some bug fixes and enhancements), and the JHDL framework.
They could do source level debugging, assembly level debugging,
single stepping, etc.
You could single step the processor and view in the JDHL environment
a generated schematic with all signals and buses annotated with current
The team also showed an xr16 state window which showed PC and next PC,
current and next instruction, the xr16 register file, immediate prefix,
and also the two inputs and output of the ALU.
(Looking back now, I told them, as I told Mike Butts long ago,
that I would split the XSOC Project Kit
into two pieces, relicensing the XSOC architectural
and compiler tool chain components (save
lcc which has its own license)
under some open source license.
Sorry folks. Please stay tuned just a little while longer.)
Tuesday -- day two
R.Franklin, et al, BYU,
Assisting Network Intrustion Detection with Reconfigurable Hardware.
Presenter Prof. Brad Hutchings, BYU, described how to accelerate SNORT network
intrustion using an FPGA to match the SNORT intrusion signature
regular expression database.
The results were quite good; for example for a regexp of
4,900 characters, the FPGA implementation scanned at 784 kB/s compared
to a software implementation (Pentium 3/750 MHz) of 1.72 kB/s,
for a cost of about 1.25 slices/character in the regexp.
This paper builds upon a very clever paper,
R.Sidhu and V.Prasanna:
Fast regular expression matching using FPGAs.
This subject was personally humbling in that Mr. Sidhu asked in this
about the practice of running NFAs in programmable logic.
And yours truly
with conventional wisdom, which is you
convert the regexp to an NFA and from that to a DFA. Bzzt!
Tuesday evening -- Survey of Nanoscale Digital System Technology
Speakers: Mike Butts (Cadence Design Systems), Andre DeHon (CalTech),
Phil Kuekes (HP Labs).
FCCM'02 Nano-Technology Panel Session.
Please return later for my write up of this session.
Press releases du jour
Xilinx: Xilinx Extends Speed And Density Leadership By Shipping Industry's Largest And Fastest Programmable IC.
Significantly, the first shipping FPGA with over 100,000 logic cells.
"0.15" micron CMOS? According to
"The most advanced products within this series, the Virtex-II FPGAs,
are built at UMC's 300mm Fab 12A on the company's 130 nm (0.13 micron)
eight layer copper/low-k process." [emphasis added]
And Most Expensive? "The Xilinx XC2V8000 is immediately available. Second half 2003 pricing for the XC2V8000 device is $3960 in volumes of 10,000 units."
A $40M order! Phew, I wonder what the 3Q02 Q100 price is?
("If you have to ask, you can't afford it.")
Can you implement a 2V8000 design on a current PC?
I wonder how long a PAR run takes.
Altera Optimizes Leading-Edge IP Cores for Stratix FPGAs.
Charmed Labs' Xport|
for the Nintendo Game Boy Advance.
Xport is an XCS10XL-TQ144-based prototyping kit for the Game Boy Advance,
which is itself a cool and inexpensive little machine
with a ARM7TDMI, 4 MB of RAM, 240x160 TFT LCD, for under $70.
The Xport board also has EEPROM for the FPGA configuration memory,
and flash for its GBA application memory.
The kit requires Foundation Student Edition 2.1i or the like,
but includes a GCC tool chain for building apps for the GBA,
and utilities for downloading your hardware and software designs
into the EEPROM and FLASH.
The kit is $129. I will order one and a GBA. It could be fun to port
XSOC/xr16 to it (time permitting).
At the movies|
Yesterday we saw Spy Kids 2. Good fun, although the story
was inferior to "Episode 1".
About two minutes into the film there is a brief shot of the
innards of the control panel of the Juggler amusement park ride
at the fictional Troublemaker Studios Park.
Unlike most films and ads, which use some non-techie's conception
of electronics, this control panel was well grounded in reality,
and had both Altera and Xilinx Inside. Perhaps that helps to
explain the events which ensue...
Deconstructing the relationship
Here's more irreverent follow-up pure speculation to my
Xilinx/IBM eFPGA piece.
Maybe, taken in isolation, the upside for Xilinx of this announcement
is not so compelling, considering the numerous engineering, tools,
and business complexities involved.
But taking into account the Xilinx/IBM Microelectronics partnership
in toto, perhaps this work is the discharge of an obligation,
if you will, of the larger deal for IBM's embedded
PowerPC and SoC technologies first manifest in Virtex-II Pro.
That the two phases were not revealed simultaneously might simply
reflect real world engineering scheduling necessities.
Thinking along those lines, an "eFPGA for PPC/SoC (plus/minus royalties
and IBM fab capacity)" equitable deal analysis makes this
announcement almost predictable.
Otherwise it would have been -- what -- a
"royalties for PPC/SoC (plus fab capacity)" deal!?
Hardly an equitable intellectual property value exchange.
Big companies think big and deal big.
Edsger Dijkstra died.
He was one of the founding fathers of Computer Science.
His rigorous and very productive works have spanned many subdisciplines of
CS, have made widespread contributions to computing and to the Betterment
of Mankind, so much so that today we take many of his results for
granted: shortest path (Dijkstra's algorithm) and many algorithm results,
programming language (Algol 60) implementation techniques, including
stacks for recursive functions, mutual exclusion, semaphores
and critical sections (P() and V() -- passeren
and vrijgeven -- "to pass" and "to give free" in Dutch), cooperating
sequential processes, dining philosphers (deadly embrace, starvation, fairness),
structured programming, and the discipline of programming via stepwise
refinement, applying invariants, and proofs of correctness.
Go To Considered Harmful
(but see also the last paragraph of
EWD1308: What led to "Notes on Structured Programming");
EWD498: How Do We Tell Truths that Might Hurt?;
EWD1304: The end of Computing Science?.
I had the privilege of hearing Prof. Dijkstra speak twice
in December, 1999. In fact, here is the University of Waterloo
CS Seminars Schedule for a few select days that week.
And some of my notes:
Wednesday, 1 December 1999
Timothy Chan: -- The Dynamic Planar Convex Hull Problem
Jan Gray: -- Homebrew Processors and Integrated Systems in FPGAs
Thursday, 2 December 1999
Edsger W. Dijkstra: -- Calculational Mathematics
Friday, 3 December 1999
Edsger W. Dijkstra: -- Proofs and Programs
"There were two Dijkstra talks, both theory talks. The first, he gave
20 quick 4-line proofs on properties of "under" (abstraction of <=) and
"up arrow" (abstraction of min): (omitted).
As this article relates:
"The second, he showed how the techniques derived to prove properties
of programs (applying invariants, demonstrating termination, etc) can
be used to prove mathematical conjectures. For example, the conjecture
"given n unique points, not collinear, there exists a line which passes
through exactly 2 points" took many decades to prove in a complex proof,
but Dijkstra proved it using undergrad proofs-of-algorithms techniques
by constructing an algorithm to find such a line and showing it preserves
its invariants and terminates."
"I enjoyed the talks. My brush with greatness was in the second talk.
After he concluded his proof, I asked 'don't you have to show your
construction there preserves your invariant there?' 'Oh, yes,
thank you very much, ... . '"
"Years from now, if you are doing something quick and dirty,
you imagine that I am looking over your shoulder and say to
yourself, "Dijkstra would not like this," well that would be
immortality for me."
on June 29, 2002. With Kristen Nygaard, he developed Simula,
which inspired object-oriented programming and Smalltalk, C++, Java, C#, etc.,
and much later also helped design Beta.
Along with Dijkstra, and Sir Tony Hoare, he also co-wrote an influential book,
Structured Programming, in 1972.
Morgan Kaufmann have just published John Hennessy and David Patterson's
Computer Architecture: A Quantitative Approach, Third Edition.
I read the first edition (1990), skipped the second edition (1996),
and am now working my way through this new third edition.
It presents many of the same themes as the earlier books, but reflecting
twelve years of Moore's Law at work and the phenomenon of the internet,
is completely overhauled with new data, new examples, and new topics.
Beside its traditional focus in price/performance optimized desktop
processors, CA:AQA3e now also explores two other design points,
server big iron, and embedded system processors.
An appendix presents answers to selected exercises.
Unfortunately, due to space considerations (1100 pages),
appendices C through I are only available online.
I encourage you to follow the above link and take a look
at some of the appendices. For example,
Appendix C - A Survey of RISC Architectures for Desktop, Server, and Embedded Computers
is the best survey I have seen of the evolution of features in
commercial RISC architectures.
If you are a serious student or practioner of computer architecture,
then you have already read the first or second edition. So far
my experience has been that the third edition is sufficiently different
from the first that your time and money will be well spent.
On the other hand, this is not a book for beginners. If you're a newbie,
you'll probably be better off reading Patterson and Hennessy's other text,
Computer Organization and Design: The Hardware/Software Interface,
which explains in great detail how a simple RISC machine works.
Hmm, maybe we should try an online CA:AQA3e study group.
FPGA CPU News, Vol. 3, No. 8
Vol. 3 (2002): Jan Feb Mar Apr May Jun Jul;
Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec;
Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.