Dear Sir
Thanks again for your suggestions. I tried to answer your questions first. Then I tried to put my data and assumptions regarding this design.
You mentioned that your FFT and FIFO are both running at 100MHz. May I assume that this is your system clock rate?
I have used the this FIFO for clock domain crossing. FIFO's input clock is running at 61Mhz and output clock is running at 100Mhz. I'm passing the data from FIFO output to FFT core input. My FFT core is running at 100Mhz.
Looking at your image above, it appears as though you have a much lower data rate than 100MHz. Can you tell me what your data rate is?
Yes my data rate is 61MSPS. My FIFO outputs valid signal at the frequency of data rate so i connected this valid signal at FFT s_axis_data_tvalid signal. I also connected FFT out s_axis_data_tready to FIFO's m_axis_data_tready signal. You can see in screenshot that fft_s_axis_data_tready is asserted and util_fifo_m_axis_valid is toggling at the rate of 61MSPS. (ILA is running at 100Mhz)
I notice that you are using a FIFO. Can you explain the purpose of this FIFO within your design? If the data rate going into the FFT is at 100MHz, then the FIFO really only makes sense if you have bursty data at a rate faster than 100MHz.
I used the FIFO for clock domain crossing. Actually my data rate (61MSPS) is slower then my system clock(100Mhz). So if i lowered my system clock to 61Mhz, I would hurt my system's performance.
Indeed, is your TLAST generation done at the rate of your incoming data? Or is your counter independent of incoming data samples?
yes TLAST is being generated at incoming data rate (61Mhz).
Here is how i'm calculating bins and indices. (Please correct me if i'm wrong.) My FFT size 65536 and FFT core is running at 100Mhz. So
bin size -->100M/65536=1525.87890625Hz
--Input 1Mhz signal
expected index-->1M/1525.87890625 = 655
actual index = 1068
offset =413
--input 2Mhz
expected index-->2Mhz/1525.87890625=1310 (2*655)
actual index =2135 (2*1068)
offset = 825 (2*413)
This calculation shows that output has a fixed offset. I'm not using square root while calculating magnitude. My multipliers and adders are all combination so they won't add extra latency.