fpgacpu.org - FPGA CPU News

FPGA CPU News

Home

News Index >>
<< CNets Log

News Index
2002
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
2001
  Jan Feb Mar
  Apr May Jun
  Jul Aug Sep
  Oct Nov Dec
2000
  Apr Aug Sep
  Oct Nov Dec

Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research

GR CPUs

XSOC
  Launch Mail
  Circuit Cellar
  LICENSE
  README
  XSOC News
  XSOC Talk
  Issues
  xr16

XSOC 2.0
  XSOC2 Log

CNets
  CNets Log

Welcome. The purpose of this site is to share the lore of designing new processors and integrated systems-on-chips using FPGAs (field-programmable gate arrays).

Wednesday, February 5, 2003
Ron Wilson, EE Times: Avoidance proposed as solution to 90-nm problems. Very interesting.
"The notion that RTL must be a description of the wiring, not simply an expression of the logic, recurred during the panel. It has also been voiced frequently by design teams (not represented on the panel) that are working with 130-nm designs. ..."
"The notion of the predesigned, configurable platform is beginning to get serious notice at 90 nm."

Monday, January 20, 2003
Happy new year (belated).
Embrace change
Anthony Cataldo, EE Times: Altera to spin new FPGA for 90-nm production
Altera: Cyclone Devices ... Shipping Ahead of Schedule.
"With only 15 months from conception to shipment, the development of the Cyclone device family is the fastest in Altera's history."

Altera: ... Delivery of First Stratix GX Devices. Now sampling.
Impressive. Congratulations. Execute, execute, execute.
Xilinx: Enables Gibson Guitar's Best of Show Award. I saw this at CES. A guitar with an ethernet jack.
"Gibson will offer MaGIC, an acronym for Media-accelerated Global Information Carrier, in every Gibson guitar within the next 12-18 months. ..."
"MaGIC uses state-of-the-art technology to provide up to 32 channels of 32-bit bi-directional high-fidelity audio with sample rates up to 192 kHz. Data and control can be transported 30 to 30,000 times faster than MIDI."
Tom Hawkins of Launchbird Design Systems, Inc., announces Confluence 0.1.
"Confluence is a simple, yet amazingly powerful hardware design language. Its flexibility and high level of expression reduces code size and complexity of a design when compared with either Verilog or VHDL. Confluence also enforces clean RTL preventing common errors and bad design practices often introduced in traditional HDL coding."
"And unlike C based approaches, design engineers love Confluence because it still feels like coding in HDL. The language is implicitly parallel and very structural. ..."
"Confluence runs on Linux x86."
OK, but please let us know when you run on the volume platform. Does Confluence employ OCaml? Interesting if so. So far, details sketchy, but welcome, the more, the merrier.
Today's schedule of the SDRForum Symposium on Use of Reconfigurable Logic in Software Defined Radios.

Saturday, December 28, 2002
FPGA-FAQ has a nice fresh list of FPGA boards.
Peter Clarke, Semiconductor Business News: Former UK defense unit offers floating-point unit for FPGAs. For MicroBlaze and the Virtex-II Pro's PowerPC(s). QinetiQ [Quixilica].
'We're already seeing applications in image and signal processing systems, control, and support of legacy hardware, where the combination of an FPGA with an embedded microprocessor core and the FPU can provide the functionality and performance of an entire DSP subsystem, said Bill Smith, manager of QinetiQ's real-time systems laboratory, in statement.'
I've been to Malvern several times, lovely place.

Monday, December 23, 2002
Thanks, Joe Strummer.

Wednesday, December 18, 2002
Free Xilinx PicoBlaze Microcontroller Expands Support to Virtex-II Series FPGAs and CoolRunner-II CPLDs. PicoBlaze User Resources.
Earlier coverage.
Regarding PicoBlaze for CPLDs, e.g. CoolRunner-II, lacking any on-chip block RAM instruction memory, the PB for CR2 requires you provide an external 16-bit wide instruction RAM. This may prove prove prohibitive in board area and cost. You can reduce the requirement to 8-bit external memory using a few more macrocells, of course, but in my opinion this application is a better fit for a device with embedded block memory (e.g. Spartan-IIE, etc.).
This does illustrate the utility and value of a modest amount of embedded RAM and/or FLASH in these larger CPLDs -- an idea whose time has come.

Monday, December 16, 2002
Xilinx: 90nm Process Technology Drives Down Costs.
IBM: IBM and Xilinx prepare for production of first 90nm chips on 300mm wafers.
UMC: UMC AND XILINX ON TRACK TO MANUFACTURE 90NM PROGRAMMABLE CHIPS ON 300MM WAFERS IN 2003.
Anthony Cataldo, EE Times: IBM, Xilinx tape out first 90-nm FPGAs.
Therese Poletti, San Jose Mercury News: IBM-Xilinx new chip moves to production.
John Blau, IDG News Service: IBM, UMC ready first 90-nanometer chips.
1.2V!

Monday, December 9, 2002
Susannah Martin, Xilinx, in EE Times: Speeding DSP solutions with FPGAs.

Tuesday, December 3, 2002
Xilinx: Tarari adopts Xilinx Technology for Reconfigurable Content Processor Solutions.
"Tarari content processors are hardware and software-based subsystem building blocks (silicon, boards, etc.) that snap into servers, appliances and network devices, allowing for the first time the inspection of application layer content at network speeds..."
Tarari.
Here, March: Applications of racks full of FPGA multiprocessors :
"I suppose my pet hand-wavy application for these concept chip-MPs is lexing and parsing XML and filtering that (and/or parse table construction for same). Let me set the stage for you. "
"Imagine a future in which "web services" are ubiquitous -- the internet has evolved into a true distributed operating system, a cloud offering services to several billion connected devices. Imagine that the current leading transport candidate for internet RPC, namely SOAP -- (Simple Object Access Protocol, e.g. XML encoded RPC arguments and return values, on an HTTP transport, with interfaces described in WSDL (itself based upon XML Schema)) -- imagine SOAP indeed becomes the standard internet RPC. That's a ton of XML flying around. You will want your routers and firewalls, etc. of the future to filter, classify, route, etc. that XML at wire speed. That's a ton of ASCII lexing, parsing, and filtering. It's trivially parallelizable -- every second a thousand or a million separate HTTP sessions flash past your ports -- and therefore potentially a nice application for rack full of FPGAs, most FPGAs implementing a 100-way parsing and classification multiprocessor."

Friday, November 29, 2002
Lauro Rizzatti, in EEdesign: Gates, lies and common sense. Rizzatti revisits the marketing gates issue.
"Realistically, now there is a simple, practical way to compare the design capacity of two emulation solutions based on the Virtex-II components. By listing type and quantity of Virtex-II devices allocated to mapping the design-under-test, possibly augmented by one or more external memory banks, you can now truthfully and reliably evaluate two or more emulation systems."
Well that's not very helpful. Far better is to simply describe a capability vector of total resources. Then you can compare across families and across vendors.
The vector should include (#LUTs, tILO, amt. of each layer of memory hierarchy, external RAM). Thus a system with two XC2V6000-5's might be
(68 KLUT, 410 ps, 1056 Kb LUT RAM, 2.6 Mb BRAM, ?) * 2 =>
(135 KLUT, 410 ps, 2 Mb LUT RAM, 5.2 Mb BRAM, ?)
and a system with four EP1S60s might be something like
(57 KLUT, ? ps, 574 M512s, 292 M4096, 6 MegaRAM, ?) * 4 =>
(228 KLUT, ? ps, 1.1 Mb M512s, 4.6 Mb M4096s, 13.5 Mb MegaRAM, ?).
If your problem domain warrants it, by all means, grow the capability vector to include multiplier resources, embedded processors, high speed serial resources, etc.
Congratulations to Altera for simply naming their new parts with the most imortant element of this capability vector, KLUTs.
See also these two articles.

Thursday, November 28, 2002
Ch-ch-ch-changes
I have returned full time to the software world; without discussing specifics, my aim is to significantly improve the lives of software developers and software users alike.
Fear not, I anticipate that this site will continue to report upon news, and muse aloud about ideas, in the FPGA CPU and SoC space. However, expect the reports to be more sporadic, and any musings to be less elaborate.
Thanks giving
To my wonderful family, thank you. How happy I am that we are here together to share life's rich pageant.
Thanks to my friends. I am so fortunate to share friendship with some most excellent kindred spirits who are so generous with their time, regard, insights, kindness, well wishes, and good cheer. Special thanks to those several of you whom I am privileged to count as close friends. Thank you for being one in a thousand.
I thank and remember those who have gone before, who lived and worked and fought and died to make the world a happier place for this ungrateful entitlement generation. Many of us here in the western world have never known want, disease, hunger, strife, nor war in our backyard. Let us remember those that still live with these hardships.
Apropos of this site, I also thank the vast legions of hard working engineers and scientists, and their collected and focused embodiments in corporations, for ceaselessly advancing the science and the processes and the devices and the platforms and the tools and the infrastructure so as to deliver, free, the miracle of modern programmable logic, that empowers even the little guy to turn ideas into tangible hardware.
And I thank you, dear reader, for frequenting this site, warts and all.
New Xilinx Spartan-IIE devices -- like manna from heaven
In September, Altera announced Cyclone, and last November, Xilinx announced Spartan-IIE. Back then I wrote,
"You might think that as Virtex-E is to Virtex, so is Spartan-IIE to Spartan-II."
"But you would be wrong. According to data sheets, whereas an XCV200 has 14 BRAMs (56 Kb) and the XCV200E has 28 BRAMs (112 Kb), in the Spartan-II/E family, both the XC2S200 and (alas) the XC2S200E have the same 14 BRAMs (56 Kb)."
"If your work is "BRAM bound", as is my multiprocessor research, this is a disappointment."

Now Xilinx announces two new, larger Spartan-IIE devices, the XC2S400E and XC2S600E. And lo and behold, unlike the BRAM deficient XC2S300E, the 2S400E and 2S600E have the same BRAM to LUT ratios as the original V400E and V600E. Thanks Xilinx!
A good thing too, for otherwise these parts would be seriously RAM poor vis-a-vis their Cyclone competition.
Xilinx: ... Extends World's Lowest Cost FPGA Product Line. FAQ. Data sheet (alas, no single PDF).
"In 2003, the company is on track to deliver a fifth generation of the Spartan Series, reaching even higher densities at significantly lower price points."

Here is the updated competitive landscape. Since Xilinx is making a big noise about the greater number of I/Os available with Spartan-IIE devices (an observation first noted by Rick "rickman" Collins), I thought I would oblige them and add a column for I/O.
(The concept of the Cyclone parts, as I understand it, is the pad ring limits determine the area of the device and hence the area for the programmable logic fabric. So what, then, does the higher ratio of I/O to logic in the Xilinx devices tell us?)

BRAM 02 03 04 03 Device Kb KLUT I/O BAP BAP BAP Ref $/KLUT XCS05XL 0 0.2 77 $2.5 [3] $12.75 XC2S50E 32 1.5 182 $7 [2] $4.67 EP1C3 52 3 104 $7 $4 [1] $2.33 EP1C6 80 6 185 $17 $9 [1] $2.83 XC2S300E 64 6 329 $18 [2] $3.00 XC2S400E 160 10 410 $27 [4] $2.70 EP1C12 208 12 249 $35 $25 [1] $2.92 XC2S600E 288 14 514 $45 [4] $3.26 EP1C20 256 20 301 $60 $40 [1] $3.21 XC2V1000 640 10 EP1S10 752 11 EP1S20 1352 18 XC2V2000 896 22 BRAM Kb: Kbits of block RAM (excludes parity bits, LUT RAM, and "M512s") KLUTs: thousands of LUTs I/O: maximum user I/O BAP: approximate best announced price, any volume $/KLUT: approximate 2003 BAP/KLUTs

References:
[1] Altera Cyclone Q&A: "High-volume pricing (250,000 units) in 2004 for the EP1C3, EP1C6, EP1C12, and EP1C20 devices in the smallest package and slowest speed grade will start at $4, $8.95, $25, and $40, respectively. ... Pricing for 50,000 units in mid-2003 for the EP1C3, EP1C6, EP1C12, and EP1C20 devices in the smallest package and slowest speed grade will start at $7, $17, $35, and $60, respectively."
[2] Xilinx Spartan-IIE press release: "Second half 2002 pricing ranges from $6.95 for the XC2S50E- TQ144 (50,000 system gates) to $17.95 for the XC2S300E-PQ208 (300,000 system gates) in volumes greater than 250,000 units."
[3] Xilinx Spartan prelease: "Spartan pricing ranges from $2.55 for the XCSO5XL-VQ100 (5,000 system gates) to $17.95 for the XC2S300E-PQ208 (300,000 system gates) in volumes greater than 250,000 units."
[4] Xilinx 2nd Spartan-IIE press release: "XC2S400E ... and XC2S600E ... and are priced at $27 and $45 respectively (250K volume)."
Other reports
Anthony Cataldo, EE Times: Xilinx packs more I/O into its top-selling FPGA line.
Crista Souza, EBN: Xilinx drives Spartan-IIE to high end.
Peter Clarke, Semiconductor Business News: Xilinx adds two FPGAs to Spartan family.
(I think it is interesting to note that no one else picked up on the much more generous servings of BRAM ports and bits in the newer devices.)
What XC2S600E means to me
Please refer back to this piece that sketches how in April '01 I PAR'd a multiprocessor of 12 clusters of 5 processors in a single V600E, using 1 1/5 BRAMs per processor.
At the time, the V600E was not inexpensive.
Now with the advent of the XC2S600E, we can see practical and inexpensive supercomputer scale meshes of simple processing elements implemented completely and cost effectively in programmable logic.
At 60 processors per $45 device (in huge volumes), that works out to just $0.75 per processing element. Loaded up with DRAM, this implies a total component cost of ~$1.50/PE, and a density of about 20-40 processors per square inch.

Wednesday, November 13, 2002
Xilinx: Free CoolRunner-II Design Kit.
"The Xilinx CoolRunner-II Design Kit is available free to qualified customers through the Xilinx worldwide distributor base. The kit can also be purchased direct from Xilinx for $49.99 through the online store ... "
The kit is apparently based upon the Digilab XC2, and includes an XC2C256-7TQ144.
Chris Edwards, EE Times: 300mm volume drive for Xilinx.

Tuesday, November 12, 2002
Ron Wilson, EE Times: Patents stir debate in configurable-processor arena.

Monday, November 11, 2002
For those of you with Windows Media Player, take a look at Jeff Bier of BDTi's talk for Stanford EE380, Comparing FPGAs and DSPs for Embedded Signal Processing (ASX). Highly recommended. Slides.
"Conclusions: High-end FPGAs can wallop DSPs on computation-intensive, highly parallelizable tasks ..."

Joel on Software: The Law of Leaky Abstractions.
A new book by Henry S. Warren, Jr.: Hacker's Delight, is chock full of arcane bit twiddling tricks and folklore. If you're the kind of person that knows what
((w-0x01010101)&~w&0x80808080) != 0
is good for, you'll love this book.
I picked up a copy at the OOPSLA bookstore, but over the weekend I saw it at the Bellevue, WA, Barnes and Noble -- filed in the Computer Security section, of course. I told the computer books clerk that the book was misfiled, but I was told there was little that could be done about it -- book shelving assignments come from on high.
"Finally, I should mention that the term 'hacker' in the title is meant in the original sense of an afficianado of computers ... If you're looking for tips on how to break into someone else's computer, you won't find them here." -- Preface

Sunday, November 10, 2002
The old ways hold us back
A long essay today: further reflections on OOPSLA, languages, and computer architecture.
One of the emergent themes of this year's OOPSLA, perhaps stirred up by James Noble and Robert Biddle's Onward! track paper, Notes on Postmodern Programming, was some sober reflection on where object-oriented programming was, and where it has gone. The future (e.g. now) is not what some thought it would be (e.g. ubiquity of very reflective and malleable and immediate environments such as Smalltalk-80 and Self). Instead C++ has won (or perhaps the winner is Visual Basic), with Java and C# catching up.
(I am unconvinced that languages and environments where a major tenet of reuse is implementation inheritance, and where there is poor support for arms-length-composition of separately authored, versioned, and deployed software components, ever stood a serious chance of scaling up into ubiquity, but that's another diatribe, for another day.)

(Incidentally, Noble said that when this Slashdot thread ran, the resulting internet traffic to fetch his paper slashdotted all of New Zealand!)
Several prominent attendees expressed the sentiment that our languages and platforms have been shaped by historical constraints that no longer apply. But their legacy lives on, and perhaps, holds us back. (Intel 386 marketing slogan: "Extended the Legacy of Leadership.")
I don't buy much of that, by the way. (I know too many grandmas who do stunning things with their computers.)
But the fact remains that C and even C++ have seen their day, and in many domains, it is time to let go of them, and move on.
Software archaeology uncovers the dominant paradigm
First let's do a gedanken experiment. Take your Windows PC, or your OS X Macintosh, or your Linux box, and freeze it in mid-computation, and save a snapshot of the entire memory image. Maybe you have a 128 MB dump. Now you put on your "software archaeologist" pith helmet, and you spend the next three years of your life pouring over the dump, picking through the strata, cataloging the arcane bits of code and data that you uncover. If you do that, I assure you that you will find, amongst other things, hundreds or even thousands of separate, mediocre, linked list and hash table implementations. You will find tens of thousands of loops written in C/C++ that amount to
for (ListElement* p = head; p; p = p->next) { ... }
That is the dominant paradigm. It stinks. It is holding us back. This little idiom and its brethren, carved in diamond in untold billions of dollars worth of genuinely useful software intellectual property, is the carrot that leads Intel and AMD and the rest, to build 100 million and soon one billion transistor processors where the actual computing takes place in less than a million of those transistors.
What is the intention behind this code? To do something with a growable collection of values -- search it, map it, collect a subset.
On modern machines, this code stinks. A full cache miss wastes many hundreds of instruction issue slots. Your hundreds of millions of transistors sit around twiddling their thumbs. Until each preceding pointer completes its odyssey...
(Upon determining that the miscreant pointer value has gone AWOL from on-chip caches, the processor sends out a subpoena compelling its appearance; this writ is telegraphed out to the north bridge and thence to the DRAM; the value, and its collaborators in adjacent cells in the line, then wend their way, from DRAM cells, through sense amps, muxes, drivers, queueing for embarkation at the DRAM D/Q pins, then sailing across the PCB, then by dogsled across the north bridge, then by steamer across the PCB again, finally landing at the processor pins, and then, dashing across the processor die and into the waiting room of the L1 D-cache ...)
... until that happens, the poor processor can't make any significant progress on the next iteration of the computation.
This scenario is so bad and so common that the microprocessor vendors use 80% of their transistor budgets for on-chip caches -- Intel as glorified SRAM vendor.
Unfortunately it is really hard to make compilers and computer architectures transform this hopelessly serial pointer-following problem into something that can go faster.
The Sapir-Whorf hypothesis
The tragedy is that the intention behind the code -- do something with a collection of values -- can often be run in parallel in O(1) or O(lg n) time. But the language and idiom of the dominant paradigm moulds programmer's thoughts and actions and they produce this serial pointer-following junk.
Tim Budd, An Introduction to Object-Oriented Programming:
"Sapir and Whorf went further, and claimed that there were thoughts one could have in one language that could not ever occur, could not even be explained, to somebody thinking in a different language. This stronger form is what is known as the Sapir-Whorf hypothesis, and remains controversial. It is interesting to examine both of these forms in the area of artificial computer languages."
If as a C programmer, all you've ever seen or been taught is that you make variable sized collections using linked lists, and that you traverse them using pointer following, then it follows that all you're ever going to write is the same deathly serial for loop we saw above.
But you could do better. You could have learned the method of data abstraction, and reused an abstract data type (ADT) library to implement your collection.
Then you might have called some of
ListContains(list, element); ListMap(list, pfnMapElement); ListSelect(list, pfnPredicate);

That would be a great step forward, because then the implementation of the list collection could evolve without modifying all the client code. You could hire the world's expert on growable list collections, and her enhancements could benefit all clients of the ADT. Over time, you could also benefit from new data structure and algorithm discoveries, such as skip lists.
And over the years, as machine architectures change, the implementation could be retuned. For example, instead of a linked list of nodes, even today's cache oriented scalar machines would benefit from a new list structure that clusters nodes together into a cache line, perhaps by making multi-element supernodes, or by employing a specialized memory allocator. Similar attention to page locality could also pay off handsomely.
Of course, few C programmers write code this way, and no standard C collection classes have been adopted by the industry.
Not the dominant paradigm: Scheme and Smalltalk-80
The C programming community rarely takes notice of the lessons of the LISP, Scheme, and Smalltalk-80 communities.
Scheme is a beautiful language. A small, clean, lexically scoped Lisp, multiparadigm (you can write pure functional, imperative, and object-oriented programming styles, amongst others), Scheme provides powerful abstraction-building facilities, including lambdas (unnamed functions as values), higher order functions (functions on functions), closures, and continuations.
(As a dyed-in-the-wool C/C++ hacker, I wish I were fully fluent in Scheme.)
My first exposure to closures was in Smalltalk-80. Smalltalk has a construction called a block. It is an anonymous function of 0 or more arguments that you can directly use in expressions. The body of a Smalltalk block has direct access to its enclosing blocks' and method's variables. And significantly, the block is a first class object, that you can squirrel away in a data structure and call later, and again, and again. Using blocks, the designers of Smalltalk built a set of powerful collection class facilities that provided not only data abstraction, but also control abstraction.
For example, in Smalltalk, you can write:
us_residents <- addresses select: [:each | each country == #USA]
meaning, take the collection of Address objects, and a predicate which determines whether each address' country is USA, and return a new collection with just those addresses whose country field is USA.
(In Smalltalk speak, we would explain the above as: "send the addresses collection a message select: with an argument block that takes one argument each; to determine the value of the block, send each the country message; send the response to that, the == message with an argument being the Symbol #USA, and return the response as the value of the block.)
Here we didn't specify how to iterate over the collection, that is left abstract and up to the collection class. That's control abstraction.
In particular, it might be possible to concurrently evaluate the predicate for each element of the collection (in parallel). (Strictly speaking, in Smalltalk-80, there was an implicit understanding that this computation would proceed in a serial fashion, but the point remains.)
Enter C++
C++ came on the scene in the late 1980s, and the mainstream moved from C to C++ by 1995 or so. C++'s built-in support for data abstraction and object-oriented programming made the practice of using ADTs more commonplace. By the late 1990s, the Standard Template Library promised to deliver powerful, efficient, reusable standardized collection classes to C++ programmers.
(STL's use of templates can be rather obscure. But for real mind-blowing use of template classes with template parameters that are themselves template classes, and with template member functions, and specializations, and macros, oh my! -- check out the Boost Libraries.)
Now I have not used STL much -- I used it in some prototypes of CNets2000 -- and it seems like a big step forward over not-invented-here roll-your-own collection classes that do not compose with each other -- but it is very iterator-centric. To the extent that programmers use iterators, compilers are obliged to generate code that exhibits the observable semantics. Serial semantics.
Still, we are making progress here. We can certainly evolve the implementation of our STL classes over time without having to modify the vast client code base.
Enter C#
C# has been shipping in Visual Studio.NET since this past February. It's a nice language. In my experience C# -- and its environment, the .NET Framework class libraries -- offer dramatically improved programmer productivity as compared to C/C++.
C# programmers have access to a rich set of collection class libraries. When you use these libraries in your code, not only do you benefit from not having to maintain the library code yourself, but you can arguably expect it to improve over time.
Now last Thursday, at OOPSLA, Anders Hejlsberg gave a keynote address on the present design of C#, and on four proposed new C# language features (more on those in the next section).
As he was explaining the many nice convenience features of C#, (none of which add the horrible orthogonal-complexity problems that C++ had), Hejlsberg came to foreach and the IEnumerable and IEnumerator interfaces. In C#, if your class implements interface IEnumerable, you can use it in a foreach loop:
using System; using System.Collections; class Range : IEnumerable { ... } class Main { public static void Main(string[] args) { foreach (int i in Range(10)) { ... } } }
Now when I first saw C# enumerators I thought "not bad, but they don't provide control abstraction or the opportunity to go more parallel over time -- pity".
Anonymous methods, or why the future looks bright indeed
Well, the talk just got better and better as Hejlsberg covered four new language features under consideration for C#. These include generics (parametric types), iterators, anonymous methods, and partial types.
New C# language features page. Hejlsberg's slides (PPT).
Now let us focus on the proposed anonymous methods, which seem just like lambda functions with closure semantics. When/if C# provides them, then it will be convenient and natural to write methods such as the earlier client of Collection>>select:. Hejlsberg's example:
delegate bool Filter(object obj); public class ArrayList { public ArrayList Select(Filter matches) { ArrayList result = new ArrayList(); foreach (object obj in this) { if (matches(obj)) result.Add(obj); } return result; } }
(Here matches is the predicate function that determines whether to add each element to the result collection.) The example continues with an application of Select():
public class Bank { ArrayList accounts; ArrayList GetLargeAccounts(double minBalance) { return accounts.Select( new Filter(a) { return ((Account)a).Balance >= minBalance; }); } }
Beautiful! Control abstraction, with no visible "serial iteration" lines! Note here that the above anonymous method is able to reference the local variable minBalance from wherever it is called. For reference, here's the same thing in Smalltalk:
getLargeAccounts: minBalance ^accounts select: [:each | each balance >= minBalance]

Implications for mainstream computer architecture
If you partake in the Kool-Aid ...
"... we are moving to a world that there are basically two places that code runs -- the JVM and the CLR. ..."
... you can see where this is going. First, both the .NET Framework, and the Java platform, embrace multithreaded programming and make it more manageable. Programmers are going to be more familiar and more comfortable with concurrency.
(See .NET Asynchronous Programming and/or the Asynchronous Method Invocation section in Don Box's Essential .NET Volume 1.)
Second, with this proposed anonymous method facility, C# programmers might find it much more convenient, natural even, to write code that is implicitly parallelism friendly.
These new developments may be just the thing to break the chicken-and-egg deadlock on chip multiprocessors. Once we have a body of important commercial software that can demonstrably take advantage of 4, 16, or 64 processors on a chip, then can we get back to riding a steep performance growth curve that is otherwise prone to level out.
I will be very disappointed if, ten years from now, the best use of a multibillion transistor substrate is a four or eight way chip multiprocessor. We can do much better than that, if we evolve our language and idioms and embrace parallelism.

"Parallelism: the new imperative." -- Jim Gray, Microsoft.
"Notation is a tool of thought." -- Ken Iverson.
"I don't know who discovered water, but it wasn't a fish." -- Marshall McLuhan.

Wednesday, November 6, 2002
All your bits and bobs
Yesterday, I attended an OOPLSA tutorial on Rotor Internals. Microsoft has just released Rotor 1.0, their shared source common language infrastructure. Of late I've been exploring the last Rotor beta, it's pretty interesting -- and vast.
Drinking the Kool-Aid:
"There's a flood coming and its going to wash away people who don't make this change. ... if you talk to people who really look at the trends in this industry ... it feels ... we are moving to a world that there are basically two places that code runs -- the JVM and the CLR. ..."
To be clear, this is not referring to embedded systems development. Yet.
Altera White Paper: Delivering RISC Processors in an FPGA for $2.00.
John Kent has some FPGA CPU and system experiments.
Loarant's AX1610. 16-bit RISC in ~360 slices in Virtex-derivative architectures. (C compiler?)
Why do we build scalar RISC processors? Because most software intellectual property is entombed in dusty deck C/C++.
Bernd Paysan: b16 A Forth Processor in an FPGA.
"Flex10K30E: About 600 LCs, the unit for logic cells in Altera. The logic to interface with the eval board needs another 100 LCs. The slowest model runs at up to 25MHz."
Like the gr00x0, it's a "literate Verilog design", e.g. the write-up is the source. I like the way he uses *WEB to permit arbitrary order of presentation of source.

Monday, November 4, 2002
Altera: Stratix GX Devices: Altera Integrates 3.125-Gbps Transceivers with World's Fastest FPGA. Altera's Stratix GX's 3.125 Gbps transceivers are a big step up from their Mercury family, and Stratix GX joins Virtex-II Pro in the ranks of fast, large FPGA fabrics with integrated 3.125 Gbps serial transceivers.
Architecture. Overview. Data sheet. Q&A. Stratix GX Devices & Nios Processor.
"The Stratix GX device's innovative MultiTrack interconnect structure improves overall system performance of the Nios processor to over 150 MHz."
What does that mean? Simply put, will Nios run at up to one instruction per clock at 150 MHz in Stratix GX?
Stratix & Stratix GX Device Architectural Differences. As you can see, the largest Stratix GX devices offer only ~1/3 the LUTs of the largest Stratix devices. This is in contrast with the Xilinx strategy, where the largest Virtex-II Pro device (2VP125, ~111000 LUTs) is comparable to the largest Virtex-II device (2V10000, ~123000 LUTs).
Some transceiver differentiators:

"At 75mW per channel and only 450mW per gigabit transceiver block, Stratix GX transceivers consume less than half the power of competing FPGA solutions."

"Dedicated circuitry for XAUI and SONET."

Hard dynamic phase alignment: "DPA simplifies high-speed board design and layout through the automatic elimination of skew introduced by unmatched trace lengths, jitter, and other skew-inducing effects".

(Note, I'm just quoting the press release here, I do not have enough insight into the problem to tell whether these dedicated features truly offer a better solution (shorter time to market, easier to design) than one fashioned out of programmable logic.)
Naive question: Will all these emerging 3+ Gbps serial transceivers directly interoperate?
Anthony Cataldo, EE Times: FPGA vendors position for serial I/O battle.
Anthony Cataldo, EE Times: Lattice FPGA integrates 3.7-Gbit/s serdes transceiver.

Thursday, October 31, 2002
Next week I'll be at OOPSLA'02, with a brief excursion to the Bay Area on Wednesday. Maybe I'll see you there.

Monday, October 28, 2002
Proactive service packs

Today I received an email notification from Xilinx that 5.1i SP2 is available for download.

Once you visit the Updates Center, you discover "SP3 is scheduled for release December 11, 2002".
Nice work.
Xilinx's system builder IDE
Xilinx: Xilinx Delivers New ISE Embedded Deveopment Kit for the Fastest FPGA Processor Solution in the Industry. Embedded Development Kit. EDK IP Cores. Note that some cores are "Additional High Value" cores.
Xilinx makes good on their earlier ISE 5.1i statement to roll-out additional tools by year end.
Welcome, Xilinx Platform Studio, and System Generator for Processors (two separate tools?), to the system builder space. The competition will be good for everyone.
Again, the question for (from) third party IP providers: how to package up our IP to make it available to customers composing systems with these system builder IDEs?
Xilinx's co-design platform
Xilinx Expands Programmable Systems Solution with Groundbreaking Co-Design Technology.
"The new technology expands the company's current solution for programmable systems by enabling customers to define an entire system in ANSI-C to obtain the most optimal implementation by rapidly partitioning and repartitioning between hardware and software. ..."
"The technology ... a library of hardware and software components, called Processing Elements (PE) optimized for particular functions. This capability enables the customer to use a best-in-class and domain-specific tool to create an optimized PE. A re-partition is a compile time switch, which enables one to profile, convert to a hardware/software implementation and debug in a matter of minutes rather than days or weeks. The hardware and software PEs come from a variety of sources, including Xilinx and third-party AllianceEDA, AllianceCORE, and Embedded Tools partners. ..."
"Commercial release of the tools is expected in mid-2003."
Earlier teaser press release.
This is an important development and is distinct from the system builder IDE discussion. For large systems, composed of a great many function blocks, it is important to have a way to explore the system partitioning -- what functions or tasks can migrate to software, what functions to hardware, how many CPUs do I drop into this system, should I use a hardware, software, or hybrid (software with coprocessor or special-purpose instructions and function units)? For time to market reasons, amongst others, you need a platform that lets both your software developers and your hardware designers get cracking on the problem even as the "system architect" is still getting a grip on the entire solution space. This could be a platform that lets you make these trade-offs late in the development cycle; a platform that lets you make derivative products with different trade-offs without starting over from scratch.
Observe that Xilinx is partnering with many of the leading EDA vendors, who have considerable experience with this problem in the ASIC space. This will probably be a complex product, an expensive product, and an enabling product for high end designs, but will probably be of little interest to designers of relatively simple embedded systems.
This kind of product seems imperative for the FPGA vendors, who need to make it easier (or at least make it possible) for their customers to quickly come to market with designs that exploit the vendors' newest, largest, highest margin devices.
It seems challenging to design and build a usable, coherent, and effective tool in the face of multi-organizational and multi-disciplinary considerations. (Several of Xilinx's partners are themselves competitors...) And it will be interesting to see if and how these two sets of products (system builder IDEs and co-design/architectural synthesis platforms) will be integrated, and will interop.
As I wrote earlier,
"The end of monolithic microprocessors
The keynote speech by Henry Samueli of Broadcom aptly demonstrated that we have left the era of monolithic microprocessors (except perhaps for personal computers) and are now working into the era of highly integrated systems. Take for example, the Broadcom BCM3351 (PDF) VoIP Broadband Gateway, which integrates a myriad functions into a single chip. Oh yes, by the way, there is a tiny MIPS core down in there, somewhere."
That's a good example of a complex SoC that needs a hardware/software co-design environment.
What marketing copy writers hear
Recalling the old Far Side comic strip, What Dogs Hear:
"blah blah GROUNDBREAKING blah blah THE MOST OPTIMAL blah blah blah blah ALL-ENCOMPASSING blah blah FULLY EXPLOIT blah blah UNIQUE blah blah BEST-IN-CLASS blah blah MORE THAN 50 PERCENT MARKET SHARE blah blah DE FACTO STANDARD METHODOLGY blah blah"
We know technology leadership when we see it, so it is unnecessary to beat us over the head.
Other coverage
Michael Santarini, EE Times: Xilinx pitches kit to embedded, software engineers.
"The announcements put Xilinx closer to its goal of making FPGAs available to the embedded-design market and, ultimately, to software engineers, said Per Holmberg, director of programmable-systems marketing. Combined, he said, those sectors represent hundreds of thousands of potential new users for programmable-logic devices."

Great quote
Anthony Cataldo, EE Times: QuickLogic puts hard cores into its FPGAs.
'"Xilinx has convinced the world that a lemon called volatility is lemonade called reprogrammability," Hart said.'
We appreciate reconfigurability -- but we also appreciate convenient, secure, low-cost non-volatile configuration solutions. See e.g. the new Lattice ispXPGA with integrated EEPROM configuration memory, and the new Altera Cyclone configuration devices ("Each configuration device costs on average 10 percent of its corresponding Cyclone device").

Friday, October 25, 2002
Xilinx demonstrates practical application of FPGA partial reconfiguration
Mike Butts brought to my attention that the new Xilinx Crossbar Switch uses partial reconfiguration of Virtex-II CLB switch matrices to build a remarkably large crossbar.
Xilinx Announces Industry's First Programmable Crossbar Switch Solution. Crossbar Switch. Partial Reconfiguration for the Crossbar Switch.
"Switching is achieved by dynamic partial reconfiguration through the FPGA configuration interface. By this mechanism, one or more changes to the input-output mapping can be made in less than 280 microseconds even in the largest FPGA devices."
White paper (registration required). The white paper describes the design of a 928x928 crossbar that runs at 155 MHz. Were it built using 928 copies of a 928-1 mux built out LUTs, it would require ~232,000 slices (464,000 LUTs). However, this crossbar switch very cleverly uses the Virtex-II CLB switch box itself to implement a 33x8 crossbar tile, so the entire 928x928 crossbar requires only 58x58 CLBs (including apparently ~27000 pass-through LUTs), apparently a factor of 17X denser than could be accomplished using muxes built out of LUTs themselves.
The design flow is unusual, using XDL to build the initial design and jbits to configure the switches.
Not being a big user of big crossbar switches myself, :-), I cannot tell whether this system is practical and attractive for commercial use, whether the peculiar design flow, or the best-case latency to change one or more switches in a column (minimum of 240-280 us) is prohibitive.
I also wonder what the latency is to make random (not precomputed) switch changes. That is, if I suddenly want out[765]=in[321], how long does it take to compute the new reconfiguration bitstream frames before downloading them? Is that included in the 240-280 us?
P4 vs FPGA MPSoC
On the fpga-cpu list, John Campbell asked:
"A CPU programmed in a FPGA is always going to be handicapped in clock speed relative to a conventional microprocessor. Whats the best we can do currently? 50MHz or so ? Pretty dismal against 2GHz for a current high end pentium."
My reply:
The high end Pentium 4 approaches 3 GHz now. The ALUs are double pumped, so each one can do up to 6 Gops. There are two such ALUs, plus a third "slow" ALU. Since the pipeline can issue at most 3 uops per cycle, we have a maximum throughput of 3*3GHz = 9 Gops. In practice, you won't see anything like that, because of branch mispredicts, cache misses, and other perils. For example, for a single cache miss that goes all the way out to main memory and is an activated-page-miss in the DRAM, the latency could easily be 100 ns. That's 100000 ps / 333 ps = 300 clock cycles or nearly a thousand potential uop issue slots. They don't call it the "memory wall" for nothing.
A high end FPGA CPU is only ~150 MHz. But you can multiply instantiate them. I have an unfinished 16-bit design in 4x8 V-II CLBs that does about 167 MHz and includes a pipelined single-cycle multiply-accumulate. You should be able to put 40 of them in a 2V1000 for a peak 16-bit computation rate (never to exceed) of 333 Mops * 40 = ~12 Gops.
(Actually, somewhat less. In an earlier MPSoC exploration I found that you must derate the single PE performance by 10-20% when you make a fabric of them, because the tools let routing spill over into adjacent PE tiles, even if the logic itself is fully floorplanned.)
In a monster 2VP100 or 2VP125 you're looking at up to 10X that -- perhaps 50 Gmacs (100 Gops). (Whether your problem can exploit that degree of parallelism, or whether the part can handle the power dissipation of such a design, I just don't know.)
When the Pentium 4 goes to main memory, it takes 50-150 ns. When the FPGA CPU multiprocessor goes to main memory, it also takes 50-150 ns. If the problem doesn't fit in cache, the P4 does not look so good.
Each P4 offers (with the help of a northbridge chipset) external bandwidth of 3.2 GB/s (64-bits at 100 MB/s-quad-pumped). Each 2V1000 offers external bandwidth of at least 8 GB/s (e.g. go configure yourself four 133-MHz 64-bit (~105-pin) DDR-DRAM channels).
When the Pentium 4 mispredicts a branch, it takes many, many (up to ~20) cycles to recover. When the FPGA CPU core takes a branch (or not), it wastes 0 or 1 cycles. If you are spending cycles parsing text, the random nature of the data can eliminate many of the benefits of a deeeeeeeeeeeeeeeeeep pipeline.
If I had to run Office, I'd rather have a P4.
If I had to classify XML data on the wire at wire speed, I'd rather have an FPGA MPSoC or a mesh of same.
I think most of you will enjoy this lecture.

Thursday, October 17, 2002
OT: Anne Eisenberg, NY Times: A Chip of Rubber, With Tiny Rivers Running Through It. Dr. Stephen R. Quake. The Quake Group.
Stephen R. Quake and Axel Scherer, in Science: From Micro- to Nanofabrication with Soft Materials.

Wednesday, October 16, 2002
Miscellaneous
Xilinx: Xilinx and Tensilica Announce Configurable Processor Support for Programmable Systems Design.
Xilinx: Xilinx and Crossbow Technologies Announce Parallel Processing Fabric IP for Programmable Systems Design. Crossbow. Flashback: Large-N Chip Multiprocessors in FPGAs.
An interesting FPGA CPU design thread on comp.arch.fpga.
EE Times: Microprocessor Forum 2002 coverage.
OmniWerks
The OmniWerks OmniStation802.11b (PDF) Wireless Development Kit is based upon the Nios-32 CPU and the uC/OS-II RTOS. Wireless connectivity is provided by an 802.11b compact flash card. FAQ.
The development board has an Altera APEX20K160E, and can be targeted by Quartus II Web Edition.
The founder and CEO of OmniWerks is Bryan Hoyer, who invented Nios and SOPC Builder while at Altera.
Bernard Cole, iApplianceWeb: First Look: OmniWerks Boards Add Security At Net Edge.
Last month, I wrote
"I believe there are opportunities for third parties to sell preconfigured FPGA SoC reference platforms for specific problem domains. Here the value-add is as likely to be in the software and business domains as in the hardware domain."
This is a perfect example.

Tuesday, October 8, 2002
Graham Seaman of (the excellent) Open Collector, in EEdesign: Open-source cores provide new paths to SoCs. Graham interviews Rudolf Usselmann of ASICS.ws and OpenCores.
"... So far we have ended up providing many, many hours of free tech support which almost crippled our company. Now we've started sending out friendly replies asking people to pay us for any support they might require."

Clive Maxfield, EEdesign: Reconfiguring chip design. More on the QuickSilver Adaptive Computing Machine.
"The solution is to use a heterogeneous architecture that fully addresses the heterogeneous nature of the algorithms it is required to implement. ..."
"... a scalar node can be used to execute legacy code ..."

QuickCores
QuickCores' Press Release: QuickCores Announces MUSKETEER IP Delivery System (PDF). Targets Actel's ProASIC^PLUS FLASH-based, non-volatile, reprogrammable FPGAs. "Re-programmable ASIC on a postage stamp features built-in JTAG real-time debugger, built-in boundary scan controller, built-in device programmer, and downloadable microcontroller IP.":
"Micros to Download?"
"Yes, times have changed. With the MUSKETEER, you can simply download your microcontroller from your favorite IP provider's web site and program it into the MUSKETEER as it downloads."

QuickCores' downloadable microcontroller cores. Products. Single-unit prices from $175.00.

Sunday, October 6, 2002
1, 2, 3, and so on
Oh my gosh -- this site is the #1 FPGA search result from Altavista, #2 from AllTheWeb, and #3 from Google!
To spiders and readers alike, thanks for visiting!
Another homebrew CPU
Bill Buzbee: Magic-1 Homebrew Computer. This terrific web site describes Mr. Buzbee's endeavor to build a new, microcoded, 16-bit (22-bit real address) processor, from scratch, wire-wrapped, in good old TTL (not FPGAs). Besides the hardware, the project also includes an lcc port and a microcode simulator. I encourage you to explore and appreciate the many interesting pages, particularly the diary.

Friday, October 4, 2002
FPGA Tcl CPU
AcroDesign Technologies' Tcl on Board! Embedded Platform.
"The prototype Tcl processor has been implemented on a Xilinx Spartan IIe development board. The processor can achieve speeds up to 36 MHz and occupies approximately 30k gates."
Scott Thibault, Green Mountain Computing Systems, Supporting Tcl in Hardware (PDF).
Announcement (comp.arch.fpga).
Down memory lane
MikeJ: FPGA Arcade.

FPGA CPU News, Vol. 3, No. 10
Back issues:
Vol. 3 (2002): Jan Feb Mar Apr May Jun Jul Aug Sep;
Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec;
Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.

Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.