• Content Count

  • Joined

  • Last visited

  • Days Won


zygot last won the day on August 23

zygot had the most liked content!


About zygot

  • Rank
    Prolific Poster

Recent Profile Visitors

6354 profile views
  1. First off you are clearly working with a board that I am unfamiliar with. Regardless, I have no way to provide a decently helpful answer to your question. I have been in the situation where I had to make architectural changes to a design ( including a completely approach ) to meet timing; especially one that needs to be incorporated into a larger overall design That uses most of the logic and memory resources and runs at a high clock rate and has multiple clock domains.. A big factor is how much of the device resources your design will use, how the logic interconnect works, how the clock routing works for your device etc. What I can say with some confidence is that you should experiment with a slower clock and scaled back design. Simulate it. Get a bitstream and look over the timing report to get an idea of path delays. Figure out how may LUTs are needed for the basic design elements. Then start scaling things up. Do timing simulation for each step as you go. It is not uncommon when working on an ambitious project to get to a conclusion faster by starting off with a few simpler preliminary design projects that help you get answers to unknowns. In engineering the shortest path (in time) between two points (start and finish) is not necessarily a straight line. It's not been my experience that you can 'fix' a design that doesn't run properly at a certain clock rate by adding a few registers here or there. Pipelining strategies, in my experience are more of an architectural holistic effort that starts with the basic elements and continues as they are grouped into larger entities.
  2. The only rule of thumb that I know of for pipelining is that when the delays associated with the combinational logic and routing path delays approach the period of your clock you should add a clocked register between that stage and the next. And here is the dilemma; Until the design is synthesized, placed and routed there are a lot of unknowns. Even if stages are dependent solely on the outputs of the previous stage and not a lot of 'global' signals controlling a bunch of stages there are differences in clock edges from LUT to LUT. This might be insignificant or likely not. When trying to pipeline a very large design as you are doing things get messy. One option is to manually place logic rather than let the P&R tools do it. This optimizes delays and routing plus adds consistency from build to build. I'd suggest starting off with a clock and sub-set of the target design to see how things are going. Then scale up incrementally addressing timing closure issues are you go. As you've no doubt already found out adding registers improves performance but also adds latencies that can make identifying the scope ( in time ) of any signal relative to other stages of a design problematic. A bit of C coding might help. I would definitely suggest working out the data flow in advance rather than as a 'seat of the pants' exercise. Diagrams hep to a point. A problem with using hard multipliers in the DSP blocks is that they are scattered throughout the device and can incur substantial path delays if your design needs a lot of them. It never hurts to pore over the complicated DSP48E literature to see what Vivado handles i the background for most usages. You can use macro instantiations but be prepared to work hard. If that's what your design requires then that what you will have to do.
  3. Calculating data rates from clock frequencies for most PC interfaces is not as straight-forward as it might seem. For USB 2.0 the peak transfer rate might be 60 million bytes per second (about 57.22 MB/s) but you'll never achieve anything close to that as an average for any payload size. This is partly due to the nature of the USB protocol and OS overhead. Also, there are latency issues if you want to stream data at some minimal rate. If you look around I've been posting lately on a project to use inexpensive single-board computers with FPGA boards like the CMOD. At best I can get around 42 MB/s data rates; if you ignore OS overhead. And this is with an FTDI device using the fastest synchronous 245 FIFO mode. The Cypress USB interface that is on the Altas FPGA board is more capable than the FTDI devices and provides somewhat better average performance. For Ethernet there are the same OS latencies as well as packet format overhead so even 1G Ethernet is likely to provide at most around 30 MB/s data rates; unless you do something very creative. UART data rates are the easiest to quantify. They don't work on 8-bit words. At least you will have 1 start and 1 stop bit per baud; so actual data rates are at best 8/10th the raw bit rate. It is possible to stream data at a rate that meets your requirements using USB 3.0. If using Digilent boards you need to have a board with an FMC connector and add a USB mezzanine board with a USB 3.0 interface. Fortunately, FTDI and Cypress both offer inexpensive development boards that work with the Nexys Video or Genesys2. Unfortunately, these FPGA boards are in a different price range than the one the you have. Other vendors offer FPGA boards with a USB 3.0 interface such as Opal Kelly but again the price is usually much higher than Digilent boards and have a lot less in terms of interfaces.
  4. Another update. I decided to try the DE0 Nano with the Jetson Nano since the code and application have evolved a bit since the CMOD-A7 versions. Win7 PC Up^2 N4200 Jetson Nano Data Sectors Payload Bytes Upload MB/s Download MB/s Upload MB/s Download MB/s Upload MB/s Download MB/s 127 65024 43.1471 42.1225 42.853889 45.721241 35.106934 38.934151 511 261632 42.1225 42.7433 42.979492 46.239983 35.065121 38.884438 767 392704 42.0514 42.7561 42.898247 20.149193 33.739391 16.723478 1023 523776 41.9084 42.7521 42.931019 12.217526 35.086201 10.197824 2047 1048064 40.4817 42.7381 42.907658 9.832075 35.111324 7.109247 4095 2096640 42.0431 37.316 6143 3145216 41.3916 39.1262 As you can see the problems previously reported for the Jetson Nano were likely due to compilation issues. I've made no attempt at a threaded application which might help. The drop-off in data rates for higher payloads might be due to Linux driver behavior. The Jetson is archaa64; the UP Squared is x86_64. The data rates for the UP^2 at or below 256KB payloads are unexpectedly high. All rates are based on the total time that data was being sent or received by the FPGA and measured in 60 MHz clock periods.
  5. As promised I ported my PC test application to run on Linux Mint 18 64-bit. The platform is an Up Squared SBC with a Pentium N4200 and 8 GB of memory. This platform is still within a 15W power envelope useful for embedded projects. My Windows PC development application was C++ though the only C++ functionality was using streams for file IO. I had a devilish time trying to port it to Mint so I ended up just making it into a C program and having no file IO. Using the DE0 Nano I was able to send data up and down error free at average data transfer rates > 40 MB/s consistently. Curiously, past 256KB data payloads I observed download rates fall below 15 MB/s though upload rates stayed above 40 MB/s. This is definitely not consistent with what my experience has been with PC platforms. As all application data storage is kept in memory I assume that this is a bottleneck on this board. My FPGA design counts 60 MHz clocks the the state machine spends in the data download and upload states so the rates are extremely accurate, from the point of view of the FPGA. Timing and other instrumentation is reported to the host application in the Status Sector for every transaction. From the point of view of the SBC there are a lot of factors diminishing those rates. Anyway, I've shown that it's possible to cheaply combine an FPGA and a cheap SBC on at least one platform with a relatively high speed interface. Overall performance suffers from platform dependent factors; so experimenters should not make assumptions about expected performance for a particular application.
  6. It would probably be more useful for you to know why you were having issues. As you found out, it is possible to create a Verilog module that can't be instantiated in VHDL. Specifically, try declaring your module port IO without wire or register assignments. I know, it involves more writing as you'll have to add wire and reg assignments anyway. There is no std_logic_vector equivalent in VHDL for the Verilog reg. I supposed that you get points for finding a work-around but then you also get deductions for claiming that you've solved your problem... Friendly advise. Instead of seeing issues as obstacles to be avoided see them as opportunities for experimentation and learning. You'll be much more productive and happier in the long run.
  7. @birca123 I was waiting with with you for a valid link to the possible solution to you problem(s). First, let's see if I understand what it is that you are doing. You want to transfer a image form your FPGA to a PC. The image consists of 640x480 binary bytes or approximately 307200 8-bit unsigned chars. You selected a non-standard data rate of 256000 baud with no flow control. To send one image would take (3017200/256000)*10/8 = 1.5 seconds if your FPGA and the PC can keep up. One suggestion would be to add flow control. This could either be in the form of hardware RTS/CTS or embedding XON/XOFF control characters. If you are sending binary data then 7-bit control characters would seem to be an unworkable option. Given the long time to send an image I'm assuming that there is no minimum data rate to make your project work. I typically use 921600 baud, without flow control, using an interpreted Python application and haven't had issues. What I don't typically do is try and send very large blocks of data at a time; or use a ZYNQ PS UART. You can do the math to see what adding fifo buffers, either on your FPGA or in your OS, will do to overcome expected down time in either your ZYNQ or PC application. It's not too hard to create a PC application that doesn't have enough time slots to keep up with even a pedestrian 256000 data rate. You wouldn't an application running on an OS to run without interruption for 1.5 seconds would you? What I would suggest is that you make figuring out where things are going wrong by making this a side debugging project. You are,after all, using a great platform ( the ZYBO ) for instrumenting your designs' short-term and long term performance.
  8. My currently) favourite inexpensive prototyping FPGA board is the DE0 Nano with a Cyclone IV device and a 32 MB SDR SDRAM. It is a little better suited to this project because of the external memory size. I was able to use the same basic code that I used for the CMOD-A7 to test larger transactions. Performing a transaction uploading 2047 data sectors (4192256 payload bytes) to the DE0 Nano SDRAM and then downloading it resulted in no errors. I only have a measurement for the download; it averaged 42 MB/s. I hope to try out a test on the UP Squared board later this week and will report. I still haven't figured out why the Jetson Nano couldn't transfers data sectors reliably. It certainly isn't an issue of available memory.
  9. For this design this statement is true. Strictly speaking though IDELAYCTRL Refclk does not have to be 200 MHz. That is the only frequency that I've ever used.
  10. Don't be afraid of incorporating Verilog modules into your VHDL code; as long as you don't intend to do extensive modifications to them. You can instantiate a Verilog module in your VHDL code, just as you would a VHDL component. You just need to convert the module declaration into a form that VHDL understands. ONly a very minimal understanding of Verilog concepts is needed to do this with rare exceptions.
  11. Actually, I'm not sure what Diglent's policy is about questions that aren't specific to Xilinx or Digilent products. The various FPGA vendors are certainly competitors but I have a hard time seeing non-commercial customers as 'competitors' regardless of which vendors' products they are using. I would agree that, even though some of the people who respond to questions posted to Digilent's Forum have recent experience with a variety of FPGA vendor's devices and tools, posting questions to a website dedicated to Xilinx based products when your question is specific to Intel is a good way to get bad information and probably unwise. Also, and this hasn't happened yet, I suspect that having a lot of questions about non-Xilinx devices and tools would be confusing to a lot of readers and make the experience for many of them of reading posts to Digilent's forum less useful. Intel has a community forum as does Xilinx. Neither is, in my experience, as helpful as Digilent's most of the time. Intel is, well not Altera, and even Altera's community support wasn't that great. Digilent's Forum is a great place to ask about Digilent products and Xilinx tools. Even restricted to that it' must be hard for people to find answers that have already been posted because a a lot of questions keep getting repeated. I do heartily suggest that it would be more appropriate to seek out answers to questions like saif1's at forums where people who hang out there are very knowledgeable about the tools and devices for the platform that you are working on. There also must be vendor agnostic forums out there somewhere dealing with FPGA development tools and devices. My last word is that an awful lot of questions would be answered if the poster only took the time to read through the vendors' literature. If there's any practice that's bad form it's wasting other peoples time because you can't be bothered or don't have the time to read readily available literature. Everyone's time is as important to them as yours is to you.
  12. For anyone interested here is an update on the Jetson Nano USB FPGA project. I connected a CMOD-A735T and Adafruit FT232H breakout board using one of Adafruit's proto boards. It was pretty easy to do. The FT232H was re-programmed (the EEPROM) to be usable in Synchronous 245 FIFO mode in order to obtain the highest data throughput possible. On the OS side you have to make sure that the default VCP drivers aren't used because we need to use the D2XX driver for this mode. On my ageing Win7 PC I was able to transfer 32.5 KB at data rates of 15-21 MB/s. For a relatively small amount of data transfer this was in the expected range. 40 MB/s would be considered the high end. On my PC I was able to get to the 40 MB/s target with 512 KB data payloads though not symmetrically or consistently. Currently my FPGA design only has 64 KB for data storage. The FPGA timestamps activity so that I can get a very good idea of elapsed time to do all of the data transfers as the FPGA sees it. Timing on a PC is, of course, a more complicated analysis. The first unknown about whether or not the Jetson Nano could replicate these results was compiling an application in C++ using the FTDI D2XX driver API. Fortunately, FTDI does provide a number of ARM versions of this driver. After a bit of reworking to suit Linux C++ development I was able to compile a version of my Win7 application and try some runs on the Jetson Nano. The way my application and FPGA work is that all transfers are in transactions that consist of 1 control sector, n data sectors, and 1 status sector. A sector is 512 bytes. The control sector is always up to the FPGA and the status sector is always down to the PC application. My initial tests were a bit disappointing as 0 data sector transactions up or down always worked. 1 data sector up or down usually worked. When the amount of data exceeded the 1 KB FT232H internal FIFO (1 data sector plus either a control or status sector) the application failed consistently. The first test involved plugging the FT232H module into one of the unused Jetson Nano USB 3.0 ports; a keyboard and mouse also occupied 2 of the 4 ports. One possible explanation for the disappointing results might be the extra power of the USB attached devices on a system with a constrained power budget. I repeated the tests with all USB devices attached to a powered USB 3.0 HUB. Results were the same. My expectation is that I will have better success on the UP^2 SBC and will try that out.
  13. In a lot of ways the Atlys is still one of Digilent's better FPGA boards. The Spartan 6 on it has better IOSERDES performance than the newer Series 7 equipped Artix boards. Unfortunately, there are no add-on boards for it using the VHDC connector except for prototyping. Also unfortunate is that the differential IO to the VHDC connector has 50 series resistors complicating the usefulness of any differential IOSTANDARD options. The original Genesys is better in that regard but alas, not much in the way of add-on boards. You certainly could make your own ADC board but it wouldn't be a trivial undertaking. I do wish that I had a better answer to give you.
  14. I suppose that the first part of this sentence is a pass for the second part but I still had to wince at reading it. You can have storage using combinational (un-clocked) logic. If a synthesis tool decides that your code requires storage it will infer a latch; and provide a warning that it did so. The warning indicates that inferred latches are generally bad. One problem with learning your first HDL on your own from a book is that you might be tempted to use keywords without understanding the ramifications of your choices. There is a place for variables and := in VHDL, but not for assigning values to logic elements like std_logic or std_logic_vector. A problem with offering a critique of code in a venue like the forum is that it's easy to provide advice that, while addressing an aspect of poor coding might cause more confusion than solution. There's only so much space and time for posting coding critiques and too often an attempt at making it concise and easy it ends up being misleading. My point isn't to criticize hamster or anyone posting code; it's to suggest that there are better ways to learn how to use and HDL effectively than this format provides.
  15. Vivado and similar logic simulators like ModelSim can have an arbitrary step period. Want 1 picosecond step time? They will do that. Want to simulate a synthesized and routed design in a system. They do that too. There are a number of tools in the simulation toolbox. Why would you want to drive nails with a really large pipe wrench when there's a hammer is available? Cycle accurate simulation is fine; just not the only tool available. Behavioral simulation of HDL source code is akin to evaluating wellness from a picture. Timing simulation working with placed and routed elements with known delays is a better way to see what's going inside a design near ready for hardware. Simulator interpretations of HDL don't always agree with how the synthesis tool understands your code. In the end it's good to know how to use them all and understand what their limitations are to be effective in the simulation verification portion of your design.