• 0

# working of pipelined FFT architecture

## Question

how pipelined FFT is working from the xilinx FFT pdf if input data is loading and unloading data in memory then how stage1  radix-2 FFT is fed to stage-2 radix-2 FFT  and where latency is stored ?how it is working?Do let me know as it is not much briefly explained in xilinx FFT pdf.

in page:42 block diagram of pipelined streaming i/o as there is not much description about it do let me kow the working of each blocks of it.

## Recommended Posts

• 0

If you want to know "how" an FFT works, don't look at Xilinx's implementation.  That's a trade secret and you aren't likely to get that answer.

On the other hand, you might find the answers you are looking for by examining a similar open source FFT implementation, such as this one.  It's actually a full FFT core generator, so if you don't like the example 2k FFT found in the rtl/ directory, feel free to rebuild it for the size you want.

An FFT consists of a series of "stages" that implement "butterflies".  Then this implementation, based around a decimation in frequency approach, those stages operate on samples k and k+N/2, then on k and k+N/4, then on k and k+N/8, etc.  You can see the code for each stage here, or even the top-level FFT that connects the FFT stages together here.  The last two stages are special, but only because they can be implemented without any shifts or adds.

Now, to your question: Since each stage operates on two elements at a time, and since these elements are separated by 2^(stage_number-1) elements, each of these stages needs a memory equivalent to 2^(stage number -1).  Values can be initially stored into this memory.  Once the memory is full, the next 2^(stage_number-1) elements plus the memory saved value can go directly into the butterfly.  Hence, the butterfly starts operating at this point.  There's also a memory storage requirement at the output of the butterfly, since only the first of the two values can move forward immediately: the second value has to wait until all the first values have past, etc.

If you were to count this, there's be 1 register for the 2-pt FFT stage, 2 registers for the 4pt stage, 4 registers for the 8pt stage, 8 for the 16pt stage, and pretty soon Vivado will start using block rams: 16 for the 32-pt stage, 32 for the 64pt stage, etc.

If you want your FFT output in natural order, you'll also need to do a bit reversal stage.  The way I do this, there's one buffer filling and one emptying at every time step, hence you have 2N sample buffers for N points.

Now let's back up and talk about delay.

For an N-point FFT, there's a delay of N/2 clocks til fill the butterfly for the first stage, plus about 3-4 clocks for the butterfly (I'm not counting--could be a bit more).  The same would be true for the next stage, save that it would now be N/4 + (about) 4 clocks, then N/8 + 4 clocks, etc.

The delay through the bit reversal is likely to be a full N clocks.

All that said, I'd trust the FFT output to tell you when the first block was complete over these calculations above.  I know that for my own FFT, these delays can vary significantly from one set of FFT parameters, size, bit width, etc, to the next.

Dan

##### Share on other sites
• 0

@[email protected] thanks sir i would really appreciate your help now i would like to ask  how to calulate delays for pipelined FFT,  radix-4 FFT  and radiX-2 FFT .The above delay calulation you have mentioned for pipelined FFT so what are the delays of raidx-4 and radix-2 FFT. Do let me know as iam very much confused in these differences.

##### Share on other sites
• 0

The FFT I just pointed you at *is* a radix-2 FFT.  An FFT can be pipelined and either radix 2 or radix 4 (or radix 8 and higher--but no one does that).  It can also be a block FFT that is radix-2 or radix-4.

The big difference between radix-2 and radix-4 are the numbers of inputs (and output) to the butterfly.  A radix-2 FFT consumes two inputs and produces two outputs.  A radix-4 butterfly consumes 4 inputs and produces 4 outputs.  If you follow that math, for the first stage of a N-point FFT, using a decimation in frequency approach, a radix-4 algorithm will need to store the incoming values into memory until it has values k, k+N/4, k+N/2, and k+3*N/4, for k from 0 to N/4-1.  The butterflies will then only operate for 1/4 of of the time, and need to wait for inputs the other 3/4.  Similarly, the FFT will produce four outputs at once, while one can move on to the next stage, all the others will need to go into a memory.  Hence, your memory requirements for this stage will go up from N block RAM points to 2N, although this new stage will now accomplish the work of two of the radix-2 stages.

As for delays ... aside from filling memories, I'm not sure: I've never built a radix-4 FFT butterfly in HDL (yet).  I'm not sure how I'd go about handling the the three complex multiplies required.  Right now for my radix 2 FFT, I only have to deal with one complex multiply which I can then turn into three real multiplies.  With a radix-4 butterfly, does that mean I'd be using 12 real multiplies?  Or would those 12 somehow need to be multiplexed to share DSP hardware.  I'm not sure--I've never built one.

Normally, you just accept the delay of the FFT in your code.  Why are you so concerned about the delay?  May I ask what application you are trying to solve?

Dan

##### Share on other sites
• 0

@[email protected] sir thanks for your reply my aim of the project is i have to know the latency of radix-2 FFT with pipeline and without pipelined mode.As i know the latency of pipelined FFT is lesser than without pipelined FFT  but how to calculate it theoritically .I have to compare the latency timings of both 4k point pipelined and nonpipelined FFT which iam unable to do.Do let me know sir how to calculate latency between both pipelined and non pipelined radix-2 FFT.

##### Share on other sites
• 0

That task cannot be done without a reference implementation.  There is no "theoretical" latency value without a hardware implementation: one clock per memory access, one clock per multiply, this schedule for operations, etc.  However, if you use a reference implementation, then the result  is no longer "theoretical" but rather "as applied".

If you wish to use the Xilinx core as a reference implementation, then start a timer when the first sample is sent into the FFT and stop it when the first valid sample comes out of the FFT.

Dan