Jump to content
  • 0

Theory of pipelining/paralellism


Tickstart

Question

I am prolific on this forum for asking the most dumb questions, but hey at least i excel at something.

 

I was reading in an excellent book about computers (Digital Design and Computer Architecture), about an example involving baking cookies, which I can relate to. In it, Ben is putting dough on his only tray, and puts it in the oven and waits for it to bake. Once finished, he takes it out and puts more dough on the tray. In the improved version he has two trays, and when one is baking in the oven he prepares the other in the meantime. This is all jolly. However, when they actually start talking about circuits again, I'm confused..

To me, it's not obvious how adding flipflops to everything is better.. Surely no more work is being done. I can understand that, in the first example (fig 3.58 or "nopipe") the total delay is 9.5 ns and thus the clock cannot tick faster than this signal propagates, and by sticking a flipflop in between (thus "halving" the delay) the frequency can be increased. But it seems like you're fooling yourself thinking this will produce results faster, the flipflop doesn't magically boost the speed of electricity, the signals still have to pass through all the combinational logic?? Can someone explain to me what is going on. I know it's gonna be one of those "oh yes of course, how stupid am I.."-moments but just give it to me, I can take it.

nopipe.JPG

twopipe.JPG

threepipe.JPG

Link to comment
Share on other sites

9 answers to this question

Recommended Posts

But yes, it does make sense, I guess.. In respect to each flipflop, in the first example they have to wait for the rest of the chain before they can deliver their next unit. Think of it as one of those.. "Trafficking chains" (thanks google translate) or whatever when something is on fire and people stand in a row and hand each other buckets of water, passing them along the chain of people. Each bucket may actually take longer to get there compared to if one person were to run with it to the fire, but by paralellizing it like this, many buckets can be on their way at once. Does that make sense?

Link to comment
Share on other sites

I think it is best to view it as a car assembly line.

The quickest way to assemble a car is for each team to do their step one after each other - a car will take the combined time of all operations, but you are only building one car at a time. Fine if you a build a single Mclaren F1 race car.

The quickest way to assemble lots of cars is a production line. Each team does their step, and then the car moves on to the next team for the next process. Making the first car takes (number of steps) * (length of longest step). But once you have your first car you get another car every (length of longest step).

But what if mounting a motor takes five minutes, and the next longest operation takes only three minutes? You can only make one car every five minutes, and the rest of the teams are idle for at least 40% of the time.

The solution might be to split mounting the engine into two steps, each taking up to three minutes. Then you can make a car every three minutes. In FPGA-speak, this is pipelining to get the highest clock speed. Big gains are easy at first, but then it gets harder and harder to get any faster, as more parts of the design become critical to the timing of the overall pipeline.

The other solution might be to combine pairs of the three minute steps so no step takes longer than 5 minutes. That way you only need half the resources, yet can still produce a car every five minutes... this is the "once you meet your FPGA's timing constraints, then re-balancing the pipeline can save you resources" strategy. 

 

Link to comment
Share on other sites

I've never thought about pipelining when designing a circuit, does the synthesis tool or compiler or whatever of VHDL pipeline anything automatically? I guess paralellizing things is easier for a compiler to do than pipelining.

Is pipelinging common in HDL designs? Thank you for an excellent explanation ham.

Link to comment
Share on other sites

Hi,

Your insight about inserting of registers increasing the amount of work done is basically correct. Pipelining does add overhead. However, as shown by the cookie baking and car building examples, using pipelines may increase throughput -- the amount of work done in a unit of time.

If you look at the the design of speculative, out-of-order CPUs, you can see how extreme it can get in terms of using more energy per instruction in order to maximize instructions per second.

If you start with RTL written in Verilog, VHDL, Chisel, or Bluespec, pipelines are inserted by the developer and not by the tools. Especially for Verilog and VHDL, there is not enough information available for tools to have much hope of correctly inserting pipeline registers. Backend tools are sometimes able to retime circuits by moving combinational logic across pipeline boundaries.

High level synthesis is a different story. In this case, compilers take C/C++ and pragmas and then generate pipelined RTL. These tools can get very good results for algorithmic code.

Link to comment
Share on other sites

@Tickstart,

You might find these articles valuable:

All go into the basics of how do you handle the signaling associated with a pipeline.  Specifically, if the pipeline doesn't always operate at every clock, or if a future pipeline stage might be required to "stall" the whole pipeline.

Dan

Link to comment
Share on other sites

>> To me, it's not obvious how adding flipflops to everything is better.

simply chop the work into smaller pieces, then process them at a higher rate.

For example, take the Horner scheme for polynomial evaluation c0 + x*(c1 + x*(c2 + x * c3))).

I could do that in a single clock cycle, maybe at 10 MHz. Or I chop it into 14 operations (some to allow for hardware registers in the DSP48 blocks) and suddenly it runs at 160 MHz. I posted the code a while ago, if interested I can look it up.

Link to comment
Share on other sites

Continuing this subject, on a different note:

 

So, considering what we've learnt, how do you efficiently pipeline a design on an FPGA, i.e in Vivado etc? An initial thought is of course to maticulously plan your circuit, that's all well and fine. But can an acceptable approach be to write your specification, have a look at the synthesized circuit layout and then go back and place flip flops in between critical paths that you observe in the schematic? Perhaps Vivado even has functionality to facilitate this process.

Link to comment
Share on other sites

I was going to ignore this post but today decided to read it anyway to see what has transpired.

Pipelining is not pixie dust to be dumped into a poorly thought out design to make it better. It generally does relax timing and therefore overall throughput. It also introduces latency for the first result out of the last pipe and can cause confusion. Figuring out what's where at any particular clock cycle requires careful bookkeeping.You can do this by hand, by sketch or if the overall system is pretty complicated you may need to create your own tools to automate the analysis. Introduce storage with block ram or hard multipliers or fractional math into the mix any you can start finding clumps of scalp hair between your fingers. Depending on the routing resources pipelining may not have that much to offer. The DSP slices in the Xilinx devices are great for systolic topologies and when the output from one DSP slice feeds another close by. If you want to implement an iterative algorithm you are likely to find that the lack of a feedback register and routing resources will make a very high speed implementation difficult or impossible even though the slice can run at very high frequencies.

A basic analysis of any design is to have an understanding of possible delays in combinational logic so that you can introduce registers when the delay gets to be close to the clock period. If your combinational logic has feedback this is more important. No one likes to have a logic structure work at one clock rate in a particular device family or speed grade only to find out on another project that it now longer works reliably. Generally, except for the simplest logic, having registers between combinational logic is better. Overall timing does, of course, depend on routing so there is a bit of feedback here. A design that appears to run fine at room temperature for one bitetream may fail or work unreliably on the next one even though the design source changes seem to be insignificant. It depends on the decisions made by the synthesis and routing tools as to where things are placed. Those tools depend on having great constraints from the designer. Sometime the designer has to do the placement of logic manually to achieve a level of performance.It can be a real problem even for large companies working with the most expensive parts and having 100's of engineers doing design work. Anything that the designer can do to make the work for the timing analysis of the synthesis and routing tools easier will be in the designer's favor. Most people visiting the Digilent Forum won't have to worry about pipelining or even timing constraints. If you want to do FPGA development for a living it will be one of a number of things to master.

For anyone interested I suggest that you code up some test HDL using block ram. Try using the IP with no or minimal registering on the address and data inputs and outputs and then with maximum registering. See how fast you can make it run. I will caution you that the decision of whether or not your design works is not a trivial one and possibly beyond your skill level.  It will be a useful learning experience. I  guarantee it. You will need to have some simulation skills. If you find that exercise boring try using a registered block ram as an instruction store for a programmable bit-slice processor. Anyone know what that is?

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...