AXI read rates

StiNKy · March 25, 2017

Hi everyone.

I'm fairly new to FPGAs, I grabbed myself a Zybo Zynq and am enjoying it thoroughly so far.

However, I've created my own AXI master and hooked it up to HP0 on the Zynq. I issue 3 read requests in quick succession, but receive the results about 14 clocks apart.

I'm wondering how I can get these results faster. I can deal with the latency between request and result, but one read result every 14 clocks is slower than I was hoping.

My AXI master is running off FCLK0 (100MHz), and have tried changing the clock rate, but saw absolutely no change. Very confusing.

I realise if I issue burst reads I could get higher throughput, but unfortunately I have to do scatter reads.

I've attached my project if anyone has the time to check it out.

Any advice would be appreciated.

Thanks in advance!

test2.zip

D@n · March 25, 2017

@StiNKy,

14 cycles per operation? Let me guess, you were trying to access DDR3 memory? Still, 14 cycles seems kind of fast.

When I was working something similar with the Arty, I came up with 24 clocks at 81.25 MHz (the memory controller wouldn't let me do 100MHz clocks). I'm kind of curious where all the time is going, but I know some of the parts and pieces. The DDR3 on the Arty could only run the memory in its 9-9-9 configuration (the memory was faster, the Arty wasn't). In that configuration, it takes one controller clock (81.25 MHz) to precharge the last memory row it was using, another clock to activate the new memory row, and then another to issue the read command. Two more clocks were then required for the device to present the data back to the FPGA. If I recall correctly, Xilinx double clocks the data to avoid metastability, so that's another two clocks. Hence, I can explain 7 of those fourteen clocks. Then, when I tried to build my own AXI decoder, I found things took about another 4-8 clocks just to process the AXI request(s). (The AXI requirement to support a burst mode is nice, but when I tried to implement it, it required a lot of logic. Likewise Xilinx boasts of being able to reorder memory requests--that's going to cost some latency as well.) So ... that gets us to about 14 cycles.

You can find my own rantings and ravings on this topic, together with a bit of a discussion about how DDR3 SDRAM works here. Sadly, only the discussion at the end is accurate--at least to my device. (There was a lot of learning that took place as I put that page together.)

In my case, it was 24 cycles, something I'd like to fix but I'm not sure how to. I stopped looking when I found a Xilinx document declaring their latency to be between twenty and thirty clocks, but ... I can't seem to find that document this morning.

Dan

Notarobot · March 25, 2017

@StiNKy

Recently I did some tests of BRAM and DDR transfer rates on the Zybo board, however using Xilinx IP modules and C program example from the Lab 6 of Avnet training materials. The test was conducted in bare metal mode with CPU_CLK_FREQ_HZ 650000000. The change of the Fabric clock from 50 MHz to 100 MHz did show any noticeable effect. The results are quite similar to yours. Here they are:

-- Simple DMA Design Example -1
Above message printing took 2541 clock cycles
Sending 256 bytes
BRAM to BRAM transfer
Moving data through processor took 38717 clock cycles
Starting transfer through DMA in poll mode
Setting up interrupt system
Moving data through DMA in Interrupt mode took 2174 clock cycles
Transfered data verified
Improvement using Interrupt DMA = 17x improvement

-- Simple DMA Design Example - 2
Above message printing took 2626 clock cycles
Sending 256 bytes
BRAM to DDR3 transfer
Moving data through processor took 22493 clock cycles
Starting transfer through DMA in poll mode
Setting up interrupt system
Moving data through DMA in Interrupt mode took 2152 clock cycles
Transfered data verified
Improvement using Interrupt DMA = 10x improvement

-- Simple DMA Design Example - 3
Above message printing took 2621 clock cycles
Sending 256 bytes
DDR3 to DDR3 transfer
Moving data through processor took 2205 clock cycles
Starting transfer through DMA in poll mode
Setting up interrupt system
Moving data through DMA in Interrupt mode took 1460 clock cycles
Transfered data verified
Improvement using Interrupt DMA = 1x improvement

Please note that clock counts may vary +/-30

Hope you will find it useful.

P.S. Private timer used in the measurements is clocked at 1/2 of the processor clock.

StiNKy · March 25, 2017

5 hours ago, D@n said:

14 cycles per operation? Let me guess, you were trying to access DDR3 memory? Still, 14 cycles seems kind of fast.

How did I forget that very important piece of information. Yes, I was reading from DDR3

Good information, thanks D@n!

StiNKy · March 25, 2017

Problem 1 figured out: I was forgetting that changing the clock rate in the block design means reinitializing the PS through tcl scripts. Woops.

Now with my AXI master running at 50MHz, I get 10 clocks between reads. At 200MHz I get 23 clocks between reads. Now things start making sense.

Sign In

AXI read rates

Question

StiNKy

Link to comment

Share on other sites

4 answers to this question

Recommended Posts

D@n

Link to comment

Share on other sites

Notarobot

Link to comment

Share on other sites

StiNKy

Link to comment

Share on other sites

StiNKy

Link to comment

Share on other sites

Archived

Browse

Activity