pipeline granularity

SmashedTransistors · August 23, 2019

Hello,

I am really enjoying my Basys3. I already VHDL coded/simulated/tested a little SPI and a little I2S controller (connected to a AdaFruit DAC, but i will try/compare it with the I2S2 soon).
Thanks to many threads on this forum (double flop...), clock domains were not such a big problem

I am designing a formant/phase modulation synthesizer.
It will be based on 1024 "operators" (oscillators with phase modulation, phase hard sync and many other delicacies).

I use a dual port BRAM connected to a SPI controller for input parameters (frequencies and gains) and another set of BRAM for state variables (for example the phase of each operator).

I am designing the "operator" processor as a pipeline (so that it will calculate the equivalent of 1024 oscillators at the I2S 96kHz sample rate).

I'd like to have a rule of thumb for the granularity of the pipeline for a targeted clock frequency.
(for example the number of adders or multiplexers between two pipeline registers at 100MHz or 200MHz)

I browsed many documents and i did not find such a rule of thumb... thus i have a tendency to over-pipeline my design... and it makes it quite confusing.

Is there some Xilinx document that gives some advice/good practice ?

zygot · August 23, 2019

51 minutes ago, SmashedTransistors said:

I'd like to have a rule of thumb for the granularity of the pipeline for a targeted clock frequency.

The only rule of thumb that I know of for pipelining is that when the delays associated with the combinational logic and routing path delays approach the period of your clock you should add a clocked register between that stage and the next. And here is the dilemma; Until the design is synthesized, placed and routed there are a lot of unknowns. Even if stages are dependent solely on the outputs of the previous stage and not a lot of 'global' signals controlling a bunch of stages there are differences in clock edges from LUT to LUT. This might be insignificant or likely not. When trying to pipeline a very large design as you are doing things get messy.

One option is to manually place logic rather than let the P&R tools do it. This optimizes delays and routing plus adds consistency from build to build.

I'd suggest starting off with a clock and sub-set of the target design to see how things are going. Then scale up incrementally addressing timing closure issues are you go. As you've no doubt already found out adding registers improves performance but also adds latencies that can make identifying the scope ( in time ) of any signal relative to other stages of a design problematic. A bit of C coding might help. I would definitely suggest working out the data flow in advance rather than as a 'seat of the pants' exercise. Diagrams hep to a point. A problem with using hard multipliers in the DSP blocks is that they are scattered throughout the device and can incur substantial path delays if your design needs a lot of them. It never hurts to pore over the complicated DSP48E literature to see what Vivado handles i the background for most usages. You can use macro instantiations but be prepared to work hard. If that's what your design requires then that what you will have to do.

SmashedTransistors · August 23, 2019

Thanks Zygot,

I use an Axoloti board and C code my algorithm to test them and evaluate the different options (quantification, polynomial waveshaping vs sine in RAM etc...). It really helps to know how it should sound.

I decomposed the pipeline in smaller entities. The architecture of each component already include fine pipelining. I try to track the clock delays involved by each architecture so that the result will be consistent.

With all the elements i plan, i would say that the pipeline will consist in 20 to 30 registers.

I think it will be possible to add extra pipeline registers between the components if the internal ones are not enough. Or should i implement different architectures for the same entities with different pipelining ?

zygot · August 23, 2019

42 minutes ago, SmashedTransistors said:

Or should i implement different architectures for the same entities with different pipelining ?

First off you are clearly working with a board that I am unfamiliar with. Regardless, I have no way to provide a decently helpful answer to your question. I have been in the situation where I had to make architectural changes to a design ( including a completely approach ) to meet timing; especially one that needs to be incorporated into a larger overall design That uses most of the logic and memory resources and runs at a high clock rate and has multiple clock domains.. A big factor is how much of the device resources your design will use, how the logic interconnect works, how the clock routing works for your device etc.

What I can say with some confidence is that you should experiment with a slower clock and scaled back design. Simulate it. Get a bitstream and look over the timing report to get an idea of path delays. Figure out how may LUTs are needed for the basic design elements. Then start scaling things up. Do timing simulation for each step as you go. It is not uncommon when working on an ambitious project to get to a conclusion faster by starting off with a few simpler preliminary design projects that help you get answers to unknowns. In engineering the shortest path (in time) between two points (start and finish) is not necessarily a straight line.

It's not been my experience that you can 'fix' a design that doesn't run properly at a certain clock rate by adding a few registers here or there. Pipelining strategies, in my experience are more of an architectural holistic effort that starts with the basic elements and continues as they are grouped into larger entities.

xc6lx45 · August 24, 2019

>>thus i have a tendency to over-pipeline my design

read the warnings. If a DSP48 has pipeline registers it cannot utilize, it will complain. Similar for BRAM - it needs to absorb some levels of registers to reach nominal performance. I'd check the timing report.

At 100 MHz you are are maybe at 25..30 % of the nominal DSP performance of an Artix, but I wouldn't aim much higher without good reason (200 MHz may still be realistic but the task gets much harder).

A typical number I'd expect could be four cycles for a multiplication in a loop (e.g. IIR).

Try to predict resource usage - if FFs are abundant, I'd make that "4" an "8" to leave some margin for register rebalancing: An "optimal" design will become problematic in P&R when utilization goes up (but obviously, FF count is only a small fraction of BRAM bits so I wouldn't overdo it)

Sign In

pipeline granularity

Question

SmashedTransistors

Link to comment

Share on other sites

4 answers to this question

Recommended Posts

zygot

Link to comment

Share on other sites

SmashedTransistors

Link to comment

Share on other sites

zygot

Link to comment

Share on other sites

xc6lx45

Link to comment

Share on other sites

Archived

Browse

Activity