Understand Resource Usage for FIR Compiler

aabbas02 · April 27, 2019

I am trying to investigate how sparsity (the number of zeroes in the filter coefficients) affects the implementation of a filter on the FPGA. Specifically, I am interested in the number of BRAM blocks and DSP slices used for a sparse filter versus a non-sparse filter of the same length. I assume the filter coefficients to be symmetric and the number of taps to be odd.

I have been experimenting with the FIR compiler GUI and have observed the following.

For one output per cycle (No overclocking),
The filter coefficients [1 2 3 4 0 1 2 3 4 ] use 9 DSP slices. Shouldn't this be 4 DSP slices (if we use the symmetry of the coefficients)?
The filter coefficients [1 0 0 0 0 0 0 0 1] use 5 DSP slices. We have just the two multipliers here? Why do we need 5 DSP slices?
For both of the filter coefficients, no BRAM units are used.

When I increase the clock frequency, the number of DSP slices used goes down (although, I do not see a clean division here i.e. increasing the clock frequency by a factor of 2 does not reduce the DSP slices by half), but the number of BRAM units used increases. For example, the filter coefficients [1 2 3 4 0 1 2 3 4] use 6 DSP slices, (down from 9), and 5 BRAM blocks (previously 0 in the not overclocked case) when overclocked by a factor of 2. Similarly, overclocking by a factor of 3 reduces the number of DSP slices to 4, and the BRAM count goes to 3. Is there some relation between the number of DSP slices and the BRAM count?

Note: I have also posted this question to Xilinx Forums, but have not received any feedback.

xc6lx45 · April 27, 2019

Hi,

[1 2 3 4 0 1 2 3 4] is not symmetric in a linear-phase sense. That would be e.g. [1 2 3 4 0 4 3 2 1]. You could exploit the shared coefficients manually, see e.g. Figure 3 for the general concept. But this case is so unusual that I doubt tools will take it into account.

The tool does nothing magical. If performance matters more than design time, you'll always get better results for one specific problem with manual design. One performance metric is multiplier utilization (e.g. assuming you design for a 200 MHz clock, one DSP delivering 200M operations / second performs at 100 %. Reaching 50+ % is a realistic goal for a simple / single rate structure).

For example, do I want to use an expensive BRAM at all, when I could use ring shift registers for delay line and coefficients. Then you only need a small controlling state machine around it that does a full circular shift for each sample, muxing in the new input sample every full cycle (the BRAM makes more sense when the filter serves many channels in parallel, then FF count becomes an issue).

zygot · April 27, 2019

2 hours ago, xc6lx45 said:

The tool does nothing magical.

Amen. And that's a good view for all Xilinx IP. The structure for a simple digital filter is not that complex; you can implement them in HDL. I've done that. Xilinx IP is convenient but usually not the best approach when you have concerns about using up limited resources. The issue for using very fast resources like BRAM and DSP slices is that they are placed in particular locations throughout the device with limited routing resources for the signals between them or other logic. You can let Xilinx balance throughput, resource usage, logic placement, and throughput or you can try to do that yourself. Trying to use 100% of every BRAM or DSP resource in order to minimize the number of BRAM or DSP resources used is not easy. In my experience FPGA vendors are content to have their IP wizards make the customer think that he needs a larger and more expensive device. So that's the trade-off; let the vendors' tools do the work to save time or write your own IP and be responsible for taking care of all the little details that the IP hides form you.

I've spent some time experimenting with DSP resources from various FPGA vendors. They are complicated with a lot of modes and depending on how you use them throughput can decline substantially from the ideal. Just read the user's guide and switching specs in the datasheet to get the idea. Generally the DSP slices are arranged to perform optimally with certain topologies but not all. Implementing designs that are iterative or have feedback can get ugly; especially when you try and fit that into a larger design using most of the devices resources.

As a general rule, in my experience, use vendor IP and don't ask a lot of questions or design your own IP and be prepared to learn how to handle a lot of details that aren't obvious. Time verses convenience.

aabbas02 · April 27, 2019

Thank you!

I think I understand what you guys mean. It's just that I have come across talk of the number of multipliers in a filter being a measure of the filter's complexity. I just expected the implementation of filters to take that into account; however, what I gather from reading your posts is that the IP still (even when there are a large number of zeros in the filter coefficients) uses the full number of DSP slices because it's easier to shift the input chain? <--- Is this an accurate conclusion?

But, I am still a little short on understanding why we end up using BRAM units when we overclock to reduce the number of DSP slices. I understand zygot's reasoning that one cannot force complete usage of any particular resource (BRAM block in this case). But what necessitates the usage of BRAM for overclocking in the first place? Is the BRAM being used to store the input chain?

Of course, I also see why it's important to move on and let the vendor IP do what it has to do, but a little more context would be really nice too.

Thanks!

zygot · April 28, 2019

Xilinx encrypts the FIR compiler code so it's difficult to figure out what's going on with your experiments. You should read PG149 which is the IP product Guide for the FIR Compiler. It does indicate when coefficient symmetry optimizations might or won't be made. It also has a link to a Resource Utilization "Spreadsheet" for some FPGA family devices depending on your customized specifications. Oddly, that bit of guidance doesn't show any usages where significant numbers of BRAMS are ever used. The guide does have a section titled "Resource Considerations" that has some interesting information but not enough to answer you questions. I suspect that you'll have to understand actual resource utilization for your application empirically. (SOP from my experience when I really need optimal performance)

It's always a good idea to be familiar with all device and IP documentation though, in my experience, the IP documentation rarely addresses all of my questions when I'm deep into a design and committed to IP that's not mine.

Again, my guess is that when you specify a clock rate, sample rate, data and coefficient widths, etc resource utilization increases with increasing throughput. The same thing happens if you pipeline your code to maximize data rates. But I'm only guessing. I don't use the FIR Compiler so my views are an extrapolation of experience with other IP from all FPGA vendors that I have used.

D@n · April 28, 2019

@aabbas02,

So let's start at the top. An N-point FIR filter, such as this one, requires N multiplies and N-1 adds. Let's just count multiplies, though, for now. Let's now look at some optimizations you might apply.

A complex multiply usually requires 4 multiplies. If you have complex input and taps, that's 4N real multiplies. There's a trick you can use to drop this to 3N multiplies.
If the filter is real, and the incoming signal is complex, you can drop this to 2N multiplies by filtering each of the real and imaginary channels separately.
If your filter taps are fixed, a good optimizer should be able to replace the zeros with a constant output, the ones with data pass through, etc. Implementing taps with two or three bits this way is possible. This only works, though, if the filter taps are fixed.
Many of the common FIR filter developments generate linear phase filters. These filters are symmetric about a common point. With a little bit of work, you can use that to reduce the number of multiplies (for dynamic taps) down to (N-1)/2.
There is a very significant class of filters called half band filters. In a half band filter, every other tap (other than the center tap) is zero. With a bit of work, you can then drop your usage down to (N-1)/4 multiplies. This optimization applies to Hilbert transforms as well.

In the given list above, I've assumed that you need to operate at one sample in and one sample out per clock. In that case, there's no time or room for memory, since all of the stages need their data on every clock.

I should also point out that I'm counting multiplies, not DSP slices. If your multiplies are smaller than 18x18 bits, you may be able to use a single DSP slice per multiply. If they are larger, you might find yourself using many DSP slices per multiply. It depends.

Let's now consider the case where you have an N sample filter but you are only going to provide it with one sample of data every N+ samples (some number more than N). You could then multiplex this multiply and implement an N-point filter with 1 multiply and 2^(ceil(log_2(N)) RAM elements. If your filter was symmetric, you could process a filter roughly twice as long, or a sample rate roughly twice as fast while still using a single multiply. The half-band and hilbert tricks can apply to (roughly) double your filter size or your data rate again. In these cases, however, you can't spare the multiply at all, since it is used to implement every coefficient.

That should give you both ends of the spectrum. Now, while I don't understand how Xilinx's FIR compiler works, I can say that if I were to make one I would allow a user to design crosses between these two ends of the spectrum. In the case of a such a cascaded filter, however, you may find it difficult to continue to implement the optimizations we've just discussed, simply because the cascaded structure becomes more difficult to deal with. (By cascade, I mean cascade in implementation and not filtering a stream and then filtering it again.)

Looking at your two designs, none of the optimizations I've mentioned would apply. In the high speed filters we started out with, sure, you might manage to remove a multiply or two by multiplying by zero. On the other hand, you can't share multiplies across elements if you do so. For example, you mentioned [1 2 3 4 0 1 2 3 4]. Sure, it's got repeating coefficients, but this is actually a really obscure filtering structure, and not likely one that I (were I Xilinx) would pay to support. Try [ 1 2 3 4 0 4 3 2 1] instead and see if things change. Similarly for [ 1 0 0 0 0 0 0 0 1]. Yeah, it's symmetric about the mid-point, but in all of my symmetric filtering applications I'm assuming the mid point has a value of 2^N-1 or similar so that it doesn't need a multiply. Your choice of filters offers no such optimization, so who knows how it might work.

Hope this helps to shed some light on things,

Dan

aabbas02 · April 28, 2019

Dan,

Thank you so much! Your post answers so many questions.

So, I never intended for the filter coefficients to be [1 2 3 4 0 1 2 3 4] - I understand that this is not the definition of symmetry. With the coefficients [1 2 3 4 0 4 3 2 1], the DSP slice usage goes down (as expected).

I was reading up on the DSP48E1 Slice element here . Turns out that the element comes with something called a pre-adder (Fig. 2-14 in the linked document) that is able to exploit symmetry. For example, see page 25,26 of PG 149 . The Xilinx document recognizes and optimizes 2 coefficient structures: even symmetric and odd symmetric. One would expect that this could easily be extended to coefficient structures that have a mixture of both even and odd symmetry (for eg. [1 2 3 4 0 -4 3 -2 1], but this does not appear to be so).

Is it possible to make the IP optimize mixed symmetry too (The presence of the pre-adder in the slice seems to suggest that this should be (easily) possible).

Thanks!

xc6lx45 · April 28, 2019

9 hours ago, aabbas02 said:

But what necessitates the usage of BRAM for overclocking in the first place? Is the BRAM being used to store the input chain?

The word "overclocking" may be even misleading - this architecture is used when the data rate is significantly lower than the multiplier's speed.

Inputs to one (expensive) multiplier are multiplexed, so it can do the work for several or all taps. The "multiplexing" itself will get fairly expensive in logic fabric due to the number of data bits and coefficients. To the rescue comes the BRAM, which is essentially a giant demultiplexer-multiplexer combo, with a football field of flipflops in-between.

You can find an example for this approach here. Out-of-the-box, it's unfortunately quite complicated because it does arbitrary rational resampling. Setting the rates equal for a standard FIR filter, you end up with only a few lines of remaining code.

BTW, multiplier count as design metric is probably overrated nowadays for several reasons (the IP tool resource usage is already more practical, e.g. BRAM count may become the bottleneck).

If you can get this book through your library, you might have a look e.g. at chapter 6 (background only, this is a very old book):

Keshab K. Parhi VLSI Digital Signal Processing Systems: Design and Implementation

zygot · April 28, 2019

So, if I am getting the point of the previous two posts from xclx45 and Dan, make your own filter in Verilog or VHDL and figure out the details ( signed, unsigned, fixed point, etc, .... etc ). You can instantiate DSP48E primitives (difficult) or just let the synthesis tool do it from your HDL code (easier). Debating how things should be verses how they are when using third party scripts to generate unreadable code seems like a waste of time to me... If you don't like what you get then design what you want. If you can make the time to write your own IP ( would be nice to not depend on a particular vendor's DSP architecture ) you'll learn a lot and save a lot of time later. If a vendor's IP doesn't make timing closure for your design a nightmare and you don't have the time to figure out all the details just let the IP handle it. I suspect that trying to optimize the FIR Compiler will be frustrating at best.

I once had to come up with pipelined code to calculate signal strength in dB at a minimum sustained data rate. My approach to converting 32-bit integers to logarithms used a combination of Taylor Series expansion and look-up tables. I had a few versions**. One was straight VHDL so that I could compare Altera and Xilinx DSP tiles. One instantiated DSP48 tile primitives for a particular Xilinx device. These were fixed point designs. There's theory and there's practical experience.. they are usually not the same.

** I played with a number of approaches based on extremely limited specifications so there were quite a few versions. Every time I presented one the requirements changed and so did the complexity and resource requirements.

I should mention that my intent for mentioning this experience is not to denigrate the information presented by others or to claim superiority in any way. When getting advice it's important to put that into context. A lot of times facts aren't necessarily relevant to solving a particular problem.

If I haven't made this clear I've never had the experience that vendor IP optimizes resource usage... in fact quite the opposite. This is why in a commercial setting companies are willing to pay to develop their own IP. Sometimes FPGA silicon costs overshadow development costs.

aabbas02 · April 28, 2019

Thank you!

So, I feel I should have been more clear about why I am looking into this question. I am going through some old papers on FIR design for specific applications and nearly all the papers suggest a "proposed filter structure", using which the number of multipliers comes down. Now, what I gather from this discussion here is that, while useful if one were laying out the actual hardware of the circuit, the 'proposed filter structure' is not readily understood by the IP (except for a few special cases like symmetry and antisymmetry of the coefficients about some central tap value).

With that limitation of the IP, it then becomes reasonable to think of designing a filter that has a structure which can be exploited by the IP. This would be desirable if one wanted to use off the shelf vendor IP but still have some degree of optimality in terms of resource consumption.

Thank you for your responses! I'll be going over them again to make sure I have the correct idea.

xc6lx45 · April 28, 2019

Just be aware that most of the "legacy" material on FIR filters limits itself to what can be presented conveniently. Numerical optimization is the tool of choice and there is no "cheating". Or, taking one step back to the filter specs, there is usually no need to specify a flat stopband and it can significantly reduce the required filter size (credits to Prof. Fred Harris) This only as.example where I can avoid unnecessary constraints from using a ready-made design process by writing my own solver. Which is actually not that hard, basing it on fminsolve or fminunc in Matlab / Octave.

BTW, one reference on this topic I found useful:

author = {Mathias Lang},
title = {Algorithms for the Constrained Design of Digital Filters with Arbitrary Magnitude and Phase Responses},
year = {1999}

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.9336

The title alone is interesting - "Arbitrary Magnitude and Phase Responses" - no one says a digital filter needs to be flat or linear phase (of course, at the expense of symmetry).

Sometimes I wonder, our thinking seems to often get stuck in patterns and "templates". Take analog filters, for example: Chebyshev, Butterworth or Bessel, which one do I pick? But those are just corners of the design space, and the best filter for a given application is most likely somewhere in-between, if I can only design it (which is, again, a job for a numerical optimizer, even though this one is more difficult).

aabbas02 · April 28, 2019

Ah! I do not think it could have been said better because it really describes my situation.

I have filter coefficients that have mixed symmetry ( [1 2 3 4 0 -4 3 -2 1] is an example of what I mean by mixed symmetry ).

I want to be able to argue that this is just as good as ( [1 2 3 4 0 -4 -3 -2 - 1] ) or ( [1 2 3 4 0 4 3 2 1] ), but the Xilinx (or Altera) tools do not seem to think so.

@D@n ,
I tried out a half-band filter, but it didn't reduce the number of DSP Slices consumed. I think the IP prefers to use the full number, i.e 25 DSP slices for 49 tap half-band filter, because of the input chain shift requirement, as @xc6lx45 suggested earlier. The Xilinx Document for FIR Compiler, PG149 , does not explicitly state any reduction in the number of DSP slices. It only mentions minimizing arithmetic requirements by exploiting coefficient symmetry on pages 25,26.

D@n · April 29, 2019

@aabbas02,

What's the data rate of the filter compared to the number of taps? As in, are you going to need to accept one sample per clock into this filter, or will you have reliable idle clock periods to work with?

I'm wondering, if you have to build this, how easy/hard it might be.

Dan

aabbas02 · April 29, 2019

Actually, I can do either. I just want to show that mixed symmetry also reduces to the symmetric or anti-symmetric case. I don't have any specific latency/throughput requirement. Thanks!

Understand Resource Usage for FIR Compiler

Question

Link to comment

Share on other sites

13 answers to this question

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived