Jump to content
  • 0

Question on DSP slice timing constraints + primitives.


SigProcbro

Question

Hi,

So I was wondering for FPGA devices(like the Arty S7), how does one find the maximum time it takes for a primitive to end in a stable state? For example lets say I have an 10 tap FIR filter. In its more simple implementation there are 10 multiplies(which can be done in parallel) and 10 sums(which must be done sequentially although there is a tree based configuration which requires more summers but incurs lower total delay). How do I find the propagation delay for the 10 sums so I can figure out the fastest the FIR filter can process inputs? 

Also I'm a little confused on the differences between using VHDL operations(ex: *) verse manually setting up the DSP slices. Will the synthesis tool automatically configure the DSP slices for me if I use the multiply operation?  In general how does manually instantiating the primitives compare to using VHDL library oprations? 

Link to comment
Share on other sites

6 answers to this question

Recommended Posts

The "DC and Switching characteristics" tells you the delays in the primatives, but can't tell you the routing delays. The only way to truly know it to build the design in Vivado, and then look at the timing report. 

Inference of DSP blocks and features is pretty good as long as your design is structured to map onto the DSP slices. There are little gotchas like not attempting to reset registers in the DSP slice that don't support it.

Skim reading the DSP48 User Guide will pay off many times over in time saved from not having to redesign stuff over and over to help it map to the hardware. 

Link to comment
Share on other sites

38 minutes ago, hamster said:

The "DC and Switching characteristics" tells you the delays in the primatives, but can't tell you the routing delays. The only way to truly know it to build the design in Vivado, and then look at the timing report. 

Inference of DSP blocks and features is pretty good as long as your design is structured to map onto the DSP slices. There are little gotchas like not attempting to reset registers in the DSP slice that don't support it.

Skim reading the DSP48 User Guide will pay off many times over in time saved from not having to redesign stuff over and over to help it map to the hardware. 

So how would the synthesis handle it if it couldn't complete it in a single clock cycle?

For example if I run a process that adds up the output of the multipliers and have it output every clock cycle - will Vivado give some type of error? 

The procedural way to write this is have the system do a loop over each coefficient in the FIR filter, multiply it by the input signal and accumulate(and I have seen some VHDL code which does this), can the synthesizer take procedural code like this and turn it into parallel code on its own? 

I kinda feel VHDL is a lot harder than regular programming since it's not clear how the synthesize will handle it while it's pretty clear how it's done procedurally.  

Link to comment
Share on other sites

@SigProcbro,

There's a better way to implement an FIR than an adder tree.  Basically, you apply the input to every coefficient in the FIR at the same time, and then add things together in a line, with FF's between them.  I wrote about this method some time ago.  Even better, DSP's are optimized for this kind of operation, so they can handle it at pretty high speeds.  That said, there's a limitation associated with the bit widths you choose, so I would second @hamster's advice to read the fine manual.

As for knowing what the synthesizer does, there's lots of ways to visualize that.  One common method is to run the synthesis engine and then open the synthesized design.  You can then examine your logic as a group of connected logic blocks.  This can be very informative--for small designs.  Generally, once the design gets large enough this  display gets too difficult to read and so you'll likely start resorting to other methods.  Still, it does have its place.

Dan

Link to comment
Share on other sites

I've used DSP slices as primitives and it's no easy task. Do read the DSP User's Manual to see how it works. As as has been pointed out one unknown is routing of outputs between slices and other resources like LUTs and BRAM. This depends on your data structure. One thing that would greatly improve DSP performance for iterative algorithms would be a small, say 4 deep, R/W register that feeds the multipliers and can be written to from the adder output; as if they need to be more complicated...

If you can have an ideal structure feeding one DSP slice to another you can make use of the very fast performance that DSP slices share with other hard logic like BRAMs and FIFOs.

You don't have to do math with DSP slices; I've seen them used as high speed counters for instance.

By and large the synthesis tools do a serviceable job using DSP slices for multiplies and multiply-accumulators from simple HDL code. If you need extreme throughput for a pipelined application then be prepared to spend some time understanding all of the side-effects for every design choice you make. These are complicated resources. If you let Vivado handle their usage the complexity is invisible; if you decide to take control over all of the details be prepared for some work.  Personally, I prefer Intel's DSP architecture choices over Xilinx's but haven't had many reasons to complain about either one in my designs, when I need them.  Basically, you only really need to take charge if you have to get the highest possible performance or have an odd application.

If you are doing fixed-point operations, and or need to operate on data much wider than the standard 18x18 multiplier the book-keeping can get serious.

Link to comment
Share on other sites

On 12/21/2019 at 11:06 PM, SigProcbro said:

So how would the synthesis handle it if it couldn't complete it in a single clock cycle?

timing closure will fail but read the warnings: you'll get a notification when your design is lacking extra registers before and / or after the inferred DSP48 block that are required for maximum speed to be absorbed into the DSP48 hardware unit.

Link to comment
Share on other sites

And there's the phrase that makes FPGA developers nervous; 'timing closure'. The tools have general strategic options for synthesis and place and route. At the end of the day ( or design cycle ) timing closure is the responsibility of the designer. With a fixed architecture solution like a DSP device or GPU you often have to consider how you are going to accomplish a task within size, power, thermal, and cost budgets. With a highly configurable device like an FPGA you add considerations like timing closure. This gets difficult to maintain throughout a product life cycle as features get tweaked and added. Here is where timing simulation is critical.

Fortunately you have options. You can prototype designs to see what the upper limits for clocking a design are for particular devices. You can go wide. You can go parallel processing. You often can restrict capability. Sometimes you can consider a higher tier device family. Not infrequently, a restructuring of your implementation approach is required to comfortably fit a design into a device. For a commercial application this usually means having a design that survives years of maturing or customization.

Companies new to programmable logic are usually shocked to discover that given the exact same tools version, the exact same sources, the exact same strategy options, two configuration file builds won't result in identical bitstreams. This the nature of the beast. Before committing to a solution one needs to do experiments and analysis to figure out what's practical in the long run.

Oh, and there can be subtle differences between DSP slices depending on the family.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...