fpgacpu.org - 3-D Rendering Acceleration

3-D Rendering Acceleration

Home

LFSR Design >>
<< Rambus for FPGAs

Usenet Postings
  By Subject
  By Date

FPGA CPUs
  Why FPGA CPUs?
  Homebuilt processors
  Altera, Xilinx Announce
  Soft cores
  Porting lcc
  32-bit RISC CPU
  Superscalar FPGA CPUs
  Java processors
  Forth processors
  Reimplementing Alto
  Transputers
  FPGA CPU Speeds
  Synthesized CPUs
  Register files
  Register files (2)
  Floating point
  Using block RAM
  Flex10K CPUs
  Flex10KE CPUs

Multiprocessors
  Multis and fast unis
  Inner loop datapaths
  Supercomputers

Systems-on-a-Chip
  SoC On-Chip Buses
  On-chip Memory
  VGA controller
  Small footprints

CNets
  CNets and Datapaths
  Generators vs. synthesis

FPGAs vs. Processors
  CPUs vs. FPGAs
  Emulating FPGAs
  FPGAs as coprocessors
  Regexps in FPGAs
  Life in an FPGA
  Maximum element

Miscellaneous
  Floorplanning
  Pushing on a rope
  Virtex speculation
  Rambus for FPGAs
  3-D rendering
  LFSR Design

Newsgroups: comp.graphics.algorithms,comp.arch.fpga
Subject: Re: FPGA accelerated engines for volume rendering
Date: 22 Mar 1995 05:47:46 GMT

In <BENEDETT.95Mar20152030@caliban.dsi.unimo.it>
benedett-@caliban.dsi.unimo.it (Arrigo Benedetti) writes: 

>I'm looking for references to implementations of hardware accelerators for 
volume
>rendering algorithms (or other computationally intensive graphics 
algorithm)
>based on FPGA's.

I suspect this is not the volume rendering you mean, but maybe you'll find 
it interesting anyway, a kind of software/hardware practice and 
experience, if you will.

A while back, I did a design for a Gouraud shaded Z-buffered rendering 
accelerator, whose datapath is compiled into a Xilinx XC4003A.  Sure, it's 
probably the most well understood graphics rendering problem, and my 
implementation is simple at best (e.g. no blending, no textures), but I 
wanted to see how far one could get, at home, on a hobbyist scale.  

The inner loop (one scan line) of this simple polygon rendering algorithm 
is:
	// interpolate left to right, in (r,g,b) and z, and update
	// pixels for which z is closer than zbuf[x]:
	... set up fixed point z, dz, r, dr, g, dg, b, db ...
	for (x = xleft; x < xright; x++) {
		if (z < zbuf[x]) {             // Z-buffer check
			zbuf[x] = z;           // update Z-buffer
			buf[x] = pixel(r,g,b); // update image
		}
		// advance interpolants
		z += dz; r += dr; g += dg; b += db;
	}

When attached to 32-bits of DRAM or VRAM, and assuming a 16-bit Z-buffer, 
this design required three passes, fast page mode streaming over memory, to 
render a span of pixels across one scan line of a polygon.  That is, I 
implement the above as three passes :-

	bit closer[];
	// Pass 1: (check two Z-values per iteration)
	// initialize z0, z1, dz0, dz1
	for (x = xleft; x < xright; x += 2) {
		closer[x]   = (z0 < zbuf[x]);
		closer[x+1] = (z1 < zbuf[x+1]);
		z0 += dz0; z1 += dz1;
	}
	// Pass 2: (update up to two Z-values per iteration)
	// reinitialize z0, z1, dz0, dz1
	for (x = xleft; x < xright; x += 2) {
		if (closer[x])   zbuf[x] = z0;
		if (closer[x+1]) zbuf[x+1] = z1;
		z0 += dz0; z1 += dz1;
	}
	// Pass 3: (update zero or one pixel value per iteration)
	// initialize r, g, b, dr, dg, db
	for (x = xleft; x < xright; x++) {
		if (closer[x]) buf[x] = pixel(r,g,b);
		r += dr; g += dg; b += db;
	}

.. in hardware, in each case doing one loop iteration per clock (50 ns 
clock).

((I separated passes 1 and 2 because I thought it would be easier to do 
separate read and write passes on the Z-buffer memory, pipelined, rather 
than one pass with lots of back to back read/modify/write traffic.))

Amortized cost: 100 ns/pixel, several times faster than an R4000 software 
approach, even assuming packing several 8.8 bit fixed point interpolants 
per 64-bit register.

Besides address sequencing and DRAM/VRAM control, the hardware to do the 
above is only two 24-bit accumulators (for the 16.8 bit fixed point 
interpolations of z0 and z1, and reused for 'r' and 'g' interpolation), one 
16-bit accumulator (for the 8.8 bit fixed point interpolation of 'b'), and 
two 16-bit magnitude comparators (for comparing zbuf[i] and zbuf[i+1] with 
z0 and z1), plus a 64-by-2 bit SRAM to buffer closer/farther values (wider 
polygons would be divided into abutting narrow ones).  All of which fits 
nicely in a "3000-gate" XC4003A.

((An "accumulator" in Xilinx-speak is an adder whose output is captured in 
a register "sum", and whose inputs are sum and another register "delta", so 
that "sum += delta" is formed each clock.))

I also considered using 16-bits/pixel (565 RGB) and adding error 
distribution "dithering" to propagate the error at each pixel to later 
pixels on the same line.  This would require another adder at each 
accumulator.

In my first couple of nights using ViewLogic, XBLOX, and XACT 
1.4-something, I was able to design and compile the datapath of the above.  
Unfortunately at that point I got stuck, trying to determine how to 
interface an R3081 and then an R4000 to the FPGA, and so never did get the 
darn rendering engine built.  (The R4000 bus protocols are nontrivial, 
especially when trying to interface to an FPGA with its own, nontrivial 
input setup/hold times and output delays.)  Now, when time permits, I am 
designing a 32-bit RISC in the left half of an XC4010, and I hope to use 
the right half for a rendering accelerator as described above.  Here 
"interpolate" (one iteration of one of the above passes) will be a machine 
instruction.

Jan Gray