A Guide to Using DDR in the all HDL Design Flow


Recommended Posts

I've started a thread for people wanting to know how to use the DDR memory on their FPGA boards. I want this to be interactive as it's not possible to provide  a single demo project that works for all boards and all versions of Vivado. To get this started, I've provided a tutorial in the file XilinxDDR_Tutorial_Part_1.txt. As, the name implies this is just the beginning of the tutorial, but if you get through it, you will have a working DDR design running on your hardware.

Not everyone ( perhaps no one? ) will be happy with having to plow through a long text file but there are reasons for why I am presenting this material in this format. Perhaps, I will try and pretty the content up at a later time, if the topic is popular enough.

[Update] I've posted part 2 of the Tutorial. You can follow steps to creating a more useful DDR design by reading the file XilinxDDR_Tutorial_Part_2.txt. This isn't the end.

[Update] I've posted part 3 of the Tutorial in which we look at performance and simulation.

 

imp_top.v imp_top.xdc

XilinxDDR_Turorial_Part_1.txt

NexysVideoDdrDemo.vhd NexysVideoDdrDemo.xdc UART_DEBUGGER2.vhd XilinxDDR_Turorial_Part_2.txt YASUTX.vhd

 

NexysVideoDdrTest.vhd XilinxDDR_Turorial_Part_3.txt

Edited by zygot
Link to post
Share on other sites

The Mig IP doesn't allow you to assign the system_clock as the reference_clock input unless the Input Clock period is set to 5000ps ( 200 MHz ). Why is that?

According to the Series7 Select IO manual the reference clock for IDELAY can be 190-210 MHz or 290-310 MHz. According to the Artix datasheet we should be able to use either a 200 MHz or 300 MHz IDELAY reference clock for the -1 speed grade. So why doesn't the IP allow for using a 300 MHz system clock as the reference clock for the Nexys Video? The answer is that I don't know.

Interestingly, for the Nexys Video board, if you go with the Vivado 2019.1 default 2:1 controller PHY clock period of 3225ps this translates to a PHY clock of 310.0078 MHz. The MMCM math starts to become a problem, especially if you want a high quality clock for your DDR; and you do.

Link to post
Share on other sites

The tutorial doesn't mention why you checked the Use Internal Vref box though it comes up as unchecked by default.

If you look at the Nexys Video schematic you will see that on IO bank 35 where the DDR3 signals are connected, the two VREF pins are used as GPIO; in this case as a DQ and DDR3 address pin. So you have to check this option. On a board like the Genesys2 the bank VREF pins are connected to the same net as the DDR3 VREF pins, so you would leave this option unchecked. The Genesys2 uses the Kintex and has the DDR3 signals connected to an HP bank with more features than HR banks so the tutorial doesn't completely cover all of the details for setting up the MiG for that board.

If you are designing a board there are a lot of ways to mess up the external memory interface. Copying a design from another vendor is not a particularly good way to cheat it on doing your homework.

Link to post
Share on other sites

I've added material for Part 2 of the tutorial. Read both .txt files to follow along. Enjoy. Do post comments whether you find the tutorial useful or not, but especially if you are struggling with porting the code to your board.

Edited by zygot
Link to post
Share on other sites

Even if you aren't too interested using the material in this tutorial, it might still be a valuable exercise to go through. FPGA board vendors like to headline the advertisements for their products by highlighting the most optimistic performance statistics, even if they don't have anything to do with actual performance for real-world applications. it's a good idea to have some sort of realistic expectation about what kinds of applications a particular FPGA platform can support. This is especially true if your application requires external memory. Looking at the peak DDR data rate is likely to be very misleading.

As for non-ZYNQ FPGA Xilinx boards with DDR3 there is a large difference between the more capable and less capable boards. The Genesys2 and XEM7320 have 2 16-bit DDR3 devices that work at peak 1800 MT/s and 1600 MT/s respectively. These boards give you the flexibility of using 1 DDR3 controller with a 32-bit DQ bus or 2 DDR3 controllers, each with a 16-bit DQ bus. For FPGA boards sold by Digilent the most disappointing would be the Kintex based NetFPGA-1G-CML. It has one DDR3 device with an 8-bit DQ bus. The reference manual claims a 1600 MT/s DDR3 peak data rate but in actuality, it's less. There's no evidence that anyone, except me, has ever tested the external memory on this board. For a board with 4 1 GbE ports, having a DDR3 that is almost useless for Ethernet applications is very disappointing. The Intel based Cyclone V GT at least has 1 64-bit DDR3 memory connected to a hard external memory controller, as well as a smaller one that can be connected to a soft controller.

We can't stop vendors from claiming misleading or even incorrect performance numbers for their products but we certainly can do our own due diligence to work out if a particular FPGA platform will support our projects. I'd argue that this is a requirement.

Since the major FPGA vendors have removed hard external memory controllers from their devices it's not a bad idea to spend some time trying to work out how well they actually work for your projects. With all datasheets, never take the advertising on page 1 at face value. Sometimes you can figure out what you need to know deep within the datasheet. Sometimes, you have to do some experimenting.

[edit] For the Genesys 2, both DDR3 devices share address and control lines, so operating them independently isn't a possibility.

[another edit] Keep reading because I did execute an 1800 MT/s DDR3 design for the NetFPGA-1G-CML platform.

Edited by zygot
Link to post
Share on other sites

The MIG IP in Vivado has a number of bugs that vary from release to release, some of which are quite bad.  For people who use the HDL design flow the tools are getting more and more unfriendly with each release. Two issues with the MIG present a hardship to the HDL project design flow. One issue is that the MIG IP can't understand ucf or xdc files that have more than just the location constraint on one line. The second issue is that it is very picky about signal names in the constraints files. Digilent simply doesn't provide the constraints for using the MIG IP for their boards in a format that the MIG IP can use in Vivado. I've asked for this to be addressed. We'll have to wait to see a reply.

There's really no excuse or reason why the MIG IP wizard has to be so contrary. The fact is that the mig.prj file contains (almost) all of the information necessary to generate the IP output products; so really, the only thing that vendors need to supply in support of their boards is a working mig.prj file.

That isn't the end of Vivado issues and it's IP bugs though. Vivado simply can't keep track of all the mig.prj files that it creates and port them to a new project. The worst issue is that I've had Vivado unceremoniously terminate while having the Hardware Manager open and connected to a board while running a simulation. You might not think that this is so bad.. until you discover that sometimes this leaves behind files that you can't remove, even a administrator. I've only seen this on Windows hosts, but it's pretty disconcerting.

Edited by zygot
Link to post
Share on other sites

I've created a variant of the NexysVideoDdrTest.vhd design associated with part three of the tutorial that submits read or write commands as fast as the MIG DDR3 controller allows. For the Nexys Video board here is a test result that is representative:

    ui_clk is 100 MHz. I'm writing 16 bytes every 10 ns or 1600 bytes/s peak rate
    000130DE  =    78046  --> wtimer  --> 0.00078046 seconds
    00010000  =    65536  --> wcount  --> 1048576 bytes --> 1,343,535,863 bytes/s to write 1 MB
    00011A3D  =    72253  --> rtimer  --> 0.00072253 seconds
    00010000  =    65536  --> rcount  --> 1048576 bytes --> 1,451,256,003 bytes/s to read 1 MB
    00000000  =        0  errors

For the Nexys Video 4:1 controller the peak data rate is 1600 million bytes/s. So, I see about a 90% average data rate compared to the peak data rate for 1 MB.

I also ported the NexysVideoDdrTest to the Genesys 2 board using the MIG settings in the Reference Manual.

    ui_clk is 900/4 = 225 MHz. I'm writing 32 bytes every 1.111 ns** or 7200 bytes/s peak rate
      00013472 =    78962  --> wtimer  --> 0.00035094222 seconds
      00010000 =    65536  --> wcount  --> 2097152 bytes --> 5,975,775,727 bytes/s to write 2 MB
      0001207C =    73852  --> rtimer  --> 0.00032823111 seconds
      00010000 =    65536  --> rcount  --> 2097152 bytes --> 6,389,254,206 bytes/s to read 2 MB
      00000000 =        0  errors

For the Genesys2 I'm seeing about an 88% average data rate to read/write 2 MB compared to the peak data rate. Timing closure is an issue for this board, at the maximum possible data rate, however. That' likely why the Digilent demos run the DDR PHY at a more sedate 400 MHz.

I should mention that , due to the way that the MIG UI works, timing closure will always be a likely issue if you try and run the DDR controller this way. Also, if you have to write or read more than one word to the MIG UI to support the 8 burst data requirement your data throughput will suffer appreciably. For instance, if you have to write 2 words per UI command, then your maximum peak data rate is 1/2 the PHY data rate.

There's good news. There's no significant difference in average data rates for blocks of 1 MB to 512 MB.

NOTE: Once you start getting into weird clock periods, the numbers that the MIG IP requires are somewhat fictional, but I guess that we don't have to worry about that as long as the hardware appears to function properly.. whatever that means.

 

Edited by zygot
Link to post
Share on other sites

I tested the NetFPGA-1G-CML:

  - For the 4:1 controller with an 800 MHz PHY clock the UI runs at 200 MHz and the peak data rate is 1600 bytes/s
    04CCCD1D =    80530717  --> wtimer  --> 0.402653585 seconds
    04000000 =    67108864  --> wcount  --> 536870912 bytes --> 1,333,332,005 bytes/s to write 512 MB
    0475A8C2 =    74819778  --> rtimer  --> 0.37409889 seconds
    04000000 =    67108864  --> rcount  --> 536870912 bytes --> 1,435,104,263 bytes/s to read 512 MB
    00000000 =           0  errors

Again the average UI data rate is about 90% of the peak DDR3 PHY rate. I was wrong in my analysis that the board couldn't achieve a 1600 MT/s DDR3 data rate. This test was created using Vivado 2020.2. For unknown reasons the MIG controller doesn't provide the device_temp output that Vivado 2019.1 does.

Link to post
Share on other sites

Just for fun, I tested the venerable KC705 DDR3 SODIMM:
the KC705 4:1 controller with an 800 MHz PHY clock and 64-bit DQ bus has a ui_clk of 200 Mhz with 512-bit data
00012C53 = 76883 --> wtimer --> 0.000384415 seconds
00010000 = 65536 --> wcount --> 4194304 bytes --> 10,910,874,966 bytes/s to write 4 MB
000117BC = 71612 --> rtimer --> 0.00035806 seconds
00010000 = 65536 --> rcount --> 4194304 bytes --> 11,713,969,726 bytes/s to read 4 MB
00000000 = 0 errors

Done in VIvado 2019.1. The KC705 is lacking a few amenities but certainly has an external memory worthy of the Kintex -2 part. Getting a bitstream with Vivado was an adventure due to a design flaw with the board, and poor support by Xilinx, it's vendor.

Also, all of the Kintex boards that I've tested: KC705, Genesys2, and NetFPGA-1G-CML have the global clock in the wrong half of the FPGA (relative to the DDR pin assignments) so you have to use a CLOCK_DEDICATED_ROUTE_BACKBONE constraint and settle for less than ideal clock routing for designs using the DDR. Intel FPGA evaluation boards tend to have more external clock inputs... but their devices also have a more restrictive clocking infrastructure than the Series 7 devices. Still, for high performance boards like the Genesys2 it would seem prudent to add at least one extra, perhaps programmable, clock input. The difference in cost would be minimal but greatly enhance usability.

Again, all of the test results that I've posted are for a design pushing the MIG DDR3 controller to it's maximum data throughput and not necessarily appropriate for all designs.

Edited by zygot
Link to post
Share on other sites

@reddishI appreciate the feedback. I still have (at least) one more part to complete, but haven't decided how to approach it yet. Hopefully, more to follow....

Since most FPGA boards have some external memory it shouldn't be as daunting a task to use for projects as tools and board vendors make it out to be. Still, as with the KC705, it can be very difficult to resolve obstacles. I am unaware of any community effort to deal with exposing "hidden" information needed to complete an HDL project using the MIG. Vendor IP shouldn't be so difficult

Thanks for overlooking the typos and rough presentation...

 

Edited by zygot
Link to post
Share on other sites

To complete the board test results for the boards that I have access to here's something that might be of interest.

There aren't too many ZYNQ based boards with external memory connected to the PL. The ZCU106 happens to be one. For some reason Xilinx has decided not to support it with recent versions of its tools....

    the ZCU106 4:1 controller with an 1200 MHz PHY clock and 64-bit DQ bus has a ui_clk of 300 Mhz with 512-bit data
    013B5577 =    20665719  --> wtimer  --> 0.06888573 seconds
    01000000 =    16777216  --> wcount  --> 1073741824 bytes --> 15,587,289,617 bytes/s to write 1 GB
    011087E7 =    17860583  --> rtimer  --> 0.05953528 seconds
    01000000 =    16777216  --> rcount  --> 1073741824 bytes --> 18,035,387,152 bytes/s to read 1 GB
    00000000 =           0  --> errors

It's almost astonishing, to me at least, that the test runs at 300 MHz on hardware; but there it is. Again, you get an average data rate of about 90-94% of the 19200 million bytes/s peak.

The UltraScale+ uses the DDR4 SDRAM MIG 2.2 in Vivado 2019.1. pg150-ultrascale-memory-ip now covers this version. There's no flexibility in the IP design due, in part to the extreme clocking and timing specifications for the DDR4 memory. The MIG constraints files now abstracts pin constraints by using BOARD_PIN instead an actual pin number, and doesn't define any of the other pin constraints so that it's very difficult to find, or more importantly manage, the source code. This is the way that Xilinx is moving; less control for you over your projects and source maintenance, and more obfuscation of the details. Like it or not, you get a MicroBlaze embedded in your design to accomplish the calibration.  The only good news is that the Hardware Manager can find and report the calibration status and results. If, for some reason it can't find the embedded processor, then you can't download code or run any application on the UltraScale+ processor(s)... and yes, I managed to do that briefly. But who can argue with 16 GB/s of data storage bandwidth?

The DDR4 core is a very complex IP and good luck to anyone having to debug it for a custom board.

I had to restart the project a few times. If you start with a board design and include the DDR4 block you HAVE to connect an AXI interface to the memory. Who want's to do that? The whole purpose of having the PL get access to external memory is for your HDL design to use it. But you can create a basic board design with just the ZYNQ and then create the DDR4 core as general purpose IP. Then your code can instantiate the ZYNQ and MIG core with the UI and run the design without doing any SW design of any kind.

Edited by zygot
Link to post
Share on other sites
  • 3 weeks later...

I really should get around to posting the DDR3 performance for the Opal Kelly XEM7320. It as 3 SYZYGY ports but no UART so a port will be a bit more work than I can afford at the moment..

We can come up with a pretty good estimate for maximum real work data rates for the XEM7320 though:

400 MHz DDR4 PHY clock * 4 byte DQ interface * 90% of peak = 400*4*2*0.9 = 2.88 billion bytes/s. This is certainly more than sufficient to capture 4 ADC1410 ZMOD channels which happens to be something that I've done, at a lower data rate.

 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now