• Content Count

  • Joined

  • Last visited

  • Days Won


Everything posted by xc6lx45

  1. you might have a look at the HLS (high level synthesis) toolkit. Pay attention to tool license issues / cost.
  2. I have used openCl only on GPUs but the first search turned up a link with this: >> Specifically, we use Intel FPGA SDK for OpenCL that allows modern Intel FPGAs to be programmed as an accelerator, similar to GPUs I suspect this architecture may not mix too well with network data processing: The architectural paradigm on GPUs is "one control path, many data paths" but I suspect you may need many parallel, independent control paths. GPUs win by massive parallelism / memory bandwidth via bitwidth. The clock speed (=end-to-end latency) is fairly low. Just like on FPGAs. When I dabbled in OpenCL a long time ago, I think I started with FFT examples from Intel's MKL (Math Kernel Libraries). You may be able to find a simple example on the web looking for the following functions: clCreateProgramWithSource clBuildProgram clCreateKernel clSetKernelArg. In principle, a complete example can fit into one screen length but some "simple" things like take the maximum of a vector (reduction operations) turn out to be not simple at all... If anybody has experience with openCl specifically on FPGAs, please feel free to correct me...
  3. >> new to fpga and zybo >> i want to use open cv but i dont know where to start from Just thinking aloud: Independently of making hardware work, it might be a good idea to forget everything about hardware and video. Spend some time with openCv and offline bitmap examples on a standard Linux machine, say a virtual Linux box or a Raspberry Pi. Can speak only for myself, but I rather fight my dragons one at a time, not all at once
  4. It may be a good idea to get the cheapest FPGA board (e.g. CMOD A7 just to name one) - don't even bother with Ethernet, just use e.g. UARTs, to get a feeling for the technology. The problem I see is >> I am brand new to FPGA topic Maybe you can find an example project that needs only minor modifications. But even so, you may find out that the obviously shortest route leads through the deepest swamp... Example, my "minor modifications" cause the project to fail timing closure. What then? It's probably worse / "deeper swamp" than the same thing happening on my own design which I know line-by-line. Don't let it discourage you, take it merely as hints so that when you run into something red and flashy in your project planning you recognize it as a warning light 🙂
  5. Hi, I can tell you this much that estimating required FPGA size is a challenging topic in general. Chances are high that an abstract analysis that's not based on experience will completely miss the point. For the PC, I think you are describing operating system overhead, not hardware limitations. Before I'd consider FPGA, I'd have a look at a bare-metal implementation, which can probably be approximated by editing the network card driver in a linux distribution (e.g. use one CPU core per inbound card and keep it in a spinlock waiting for data to avoid the slow context switch etc on interrupt). If your code runs otherwise from cache, a modern e.g. 5 GHz CPU is a force to be reckoned with.
  6. Hi, you may have some luck with this. It may reach 30 MBit/s, which is the limit for one channel of the FTDI chip. It uses the MPSSE mode instead of UART to overcome the abovementioned UART limits using a dedicated clock signal. busbridge3 link I've used it for real-time data acquisition, in one case using 900+ kSPS for dual-channel 12 bit ADC data (which is about 24 MBit/s with some other overhead on the interface). However, writing "proper" code to shuffle the data from FIFO to PC is not completely trivial. One solution goes like this: - assuming the "busbridge3" interface, 32-bit address/data bus on the FPGA side - design a FIFO that collects the real time data - put a read-sensitive status "register" on the bus e.g. 0x80000000 that queries the fill level of the FIFO. On a read event, the same value is copied to a hidden register "A" - put a "FIFO pop" register on the bus e.g. 0x80000001 with this function: - - if A is non-zero, decrease A and pop a value from the FIFO to the bus - - otherwise, keep A at zero and return dummy data Your software then does this: - queue a single-word read from 0x80000000 - queue an arbitrary length block read e.g. 100 times from 0x80000001 (with address increment 0 - reading the same address over and over) - fire the USB transaction - from the result, the first value is the number of valid words (from "A") - use as many data values from the readback block and discard the rest (dummy data) - as an optimization, you may read 0x80000000 again at the end of the block to know whether there is leftover data that would require a larger block size in the next round. This can make a huge difference in avoiding dropouts if your PC isn't tuned for real-time-ish operation. If you pack two consecutive 12 bit ADC frames into one 24-bit word, you don't waste capacity on padding (busbridge3 allows 8/16/24/32 bit data width).
  7. xc6lx45

    Board for OpenCL

    Have you considered a graphics card? For getting started, even built-in graphics acceleration can be useful with limitations (e.g. no double precision). You can also run openCL code on a standard PC. Performance won't be as wild as on dedicated hardware but functionality is the same (e.g. as "software rendering" fallback option)
  8. Maybe one comment: In the ASIC world, "floorplanning" is an essential design phase, where you slice and dice the predicted silicon area and give each design team their own little box. The blocks are designed and simulated independently, and come together only at a fairly late phase. ASIC differs from FPGA in some major ways: - ASIC IOs have a physical placement e.g. along the pad ring. We don't want to run sensitive signals across the chip, need to minimize coupling for mixed-signal etc. In comparison, FPGAs are probably more robust (a complex design will definitely consider the layout, especially on larger devices. But on smaller eval boards, the first restrictions I'll probably run into are logical e.g. which clock is available where, not geometrical). - For ASICs, we need the floorplan to design the power distribution network as an own sub-project (and many a bright-eyed startup has learned electromigration the hard way). - In the ASIC world, we need to worry about wide and fast data paths both regarding power and area - transistors are tiny but metal wires are not. You might have a look at "partial reconfiguration", here the geometry of the layout plays some role.
  9. Hi, reading between the lines of your post, you're just "stepping up" one level in FPGA design. I don't do long answers but here's my pick on the "important stuff" - Before, take one step back from the timing report and fix asynchronous inputs and outputs (e.g. LEDs and switches). Throw in a bunch of extra registers, or even "false-path" them. The problem (assuming this "beginner mistake") is that the design tries to sample them at the high clock rate. Which creates a near-impossible problem. Don't move further before this is understood, fixed and verified. - speaking of "verified": Read the detailed timing analysis and understand it. It'll take a few working hours to make sense of it but this is where a large part of "serious" design work happens. - Once the obvious problems are fixed, I need to understand what is the so-called "critical path" in the design and improve it. For a feedforward-style design (no feedback loops) this can be systematically done by inserting delay registers. The output is generated e.g. one clock cycle later but the design is able to run at a higher clock so overall performance improves. - Don't worry about floorplanning yet (if ever) - this comes in when the "automatic" intelligence of the tools fails. But, they are very good. - Do not optimize on a P&R result that fails timing catastrophically (as in your example - there are almost 2000 paths that fail). It can lead into a "rabbit's hole" where you optimize non-critical paths (which is usually a bad idea for long-term maintenance) - You may adjust your coding style based on the observations, e.g. throw in extra registers where they will "probably" make sense (even if those paths don't show up in the timing analysis, the extra registers allow the tools to essentially disregard them in optimization to focus on what is important) - There are a few tricks like forcing redundant registers to remain separate. Example, I have a dozen identical blocks that run on a common, fast 32-bit system clock and are critical to timing. Step 1, I sample the clock into a 32-bit register at each block's input to relax timing, and step 2) I declare these register as DONT_TOUCH because the tools would otherwise notice they are logically equivalent and try to use one shared instance. This as an example. - For BRAMs and DSP blocks, check the documentation where extra registers are needed (that get absorbed into the BRAM or DSP using a dedicated hardware register). This is the only way to reach the device's specified memory or DSP performance. - Read the warnings. Many relate to timing, e.g. when the design forces a BRAM or DSP to bypass a hardware register. - Finally, 260 MHz on Artix is already much harder than 130 MHz (very generally speaking). Usually feasible but you need to pay attention to what you're doing and design for it (e.g. a Microblaze with the wrong settings will most likely not make it through timing). - You might also have a look at the options ("strategy") but don't expect any miracles on a bad design. Ooops, this almost qualifies as "long" answer ...
  10. Hi, you might look at the open-source xc3sprog utility, it shows how it's done. Nevermind the name, it works also with 7 series with minor modifications (such as IDCODE and flash ID). I remember there is some header in the .bit file that is quite obviously for documentation purposes (open in text editor). But then, AFAIK it does no harm since the FPGA looks for some "magic" 32-bit word to recognize the start of the binary block. That is at least for JTAG-based upload (not sure about flash, I guess it's the same but I don't know). You might have a quick look into the configuration guide https://www.xilinx.com/support/documentation/user_guides/ug470_7Series_Config.pdf if it says anything about preparing a bitstream for flash.
  11. Hi, not sure if I understand this correctly but are you sure this can be done with this chip? It sounds like functionality internal to the microchip firmware.
  12. Hi, just a thought, looking at your diagram from a large distance. Most likely you have some power supplies with two pins (non-grounded). The problem is that the output is floating at half the mains voltage, set by a high resistance (megaohms) voltage divider. Ironically, this is to protect the power supply against ESD / charge buildup on the secondary side that could break the transformer's insulation. With such a supply, if you accidentally disconnect the ground connection to your circuit, you have half the AC voltage on the supply pin. I'd double-check all involved power supplies and make sure your connection scheme has a well-established ground even if some random cable comes loose.
  13. Hi, this may be a typo but "daddr" is not a "register". It's an input to the XADC. - wait for eoc - take the output from "chan" and - put it into daddr (zero-padded with two high bits) , raise den with dwe=0 - when drdy goes up, get the result from dout
  14. Thinking of which... actually I do have a plain-Verilog FIFO around from an old design. It's not a showroom piece but I think it did work as expected (whatever that is...) For 131072 elements you'd set ADDRBITS to 17 and DATABITS to 18 for 18 bit width. module FIFO(i_clk, i_reset, i_push, i_pushData, i_pop, o_popAck, o_popData, o_empty, o_full, o_error, o_nItems, o_nFree); parameter DATABITS = -1; parameter ADDRBITS = -1; localparam ADDR_ZERO = {{(ADDRBITS){1'b0}}}; localparam ADDR_ONE = {{(ADDRBITS-1){1'b0}}, 1'b1}; localparam DATA_X = {{(DATABITS){1'bx}}}; input wire i_clk; input wire i_push; input wire i_reset; input wire [DATABITS-1:0] i_pushData; input wire i_pop; output reg o_popAck = 1'b0; output wire [DATABITS-1:0] o_popData; output reg o_error = 1'b0; output wire [31:0] o_nItems; output wire [31:0] o_nFree; output wire o_empty; output wire o_full; reg popAckB = 1'b0; reg [DATABITS-1:0] mem[((1 << ADDRBITS)-1):0]; reg [ADDRBITS-1:0] pushPtr = ADDR_ZERO; reg [ADDRBITS-1:0] popPtr = ADDR_ZERO; reg [DATABITS-1:0] readReg = DATA_X; reg [DATABITS-1:0] readRegB = DATA_X; wire [ADDRBITS-1:0] nextPushPtr = i_push ? pushPtr + ADDR_ONE : pushPtr; wire [ADDRBITS-1:0] nextPopPtr = i_pop ? popPtr + ADDR_ONE : popPtr; assign o_popData = o_popAck ? readReg : DATA_X; // === items counter === // note: needs extra bit (e.g. 4 slots may hold [0, 1, 2, 3, 4] elements) reg [ADDRBITS:0] nItems; assign o_nItems = {{{31-ADDRBITS-1}{1'b0}}, nItems}; assign o_nFree = (1 << ADDRBITS) - nItems; localparam NITEMS_ONE = {{(ADDRBITS){1'b0}}, 1'b1}; assign o_empty = nItems == 0; assign o_full = nItems == {1'b1, {{ADDRBITS}{1'b0}}}; always @(posedge i_clk) begin // === preliminary assignments === readRegB <= DATA_X; popAckB <= 1'b0; case ({i_push, i_pop}) 2'b10: nItems <= nItems + NITEMS_ONE; 2'b01: nItems <= nItems - NITEMS_ONE; default: begin end endcase o_error <= (i_push && ~i_pop && o_full) || (i_pop && o_empty); // === output register (delay 1) === o_popAck <= popAckB; readReg <= readRegB; pushPtr <= nextPushPtr; popPtr <= nextPopPtr; if (i_push) mem[pushPtr] <= i_pushData; if (i_pop) begin readRegB <= mem[popPtr]; popAckB <= 1'b1; end if (i_reset) begin pushPtr <= ADDR_ZERO; popPtr <= ADDR_ZERO; o_error <= 1'b0; o_popAck <= 1'b0; popAckB <= 1'b0; readReg <= DATA_X; readRegB <= DATA_X; nItems <= 0; end end endmodule
  15. Yes, you can combine more than one block RAM. There is more than one way to implement a FIFO. If I had to do it for myself, I'd write it in plain Verilog, it's about two or three screen lengths of code if the interface requirements are "clean" (such as, one clock and freedom to leave a few clock cycles of latency, before the first input appears at the output). I didn't check but I think there is an "IP block wizard" for FIFOs in Vivado that may do what you need. With "expensive" I meant just that, it costs a lot of money to use half an FPGA just for memory.
  16. Well, to be honest, I didn't read the datasheet to the high-capacity devices with 9M bits. So this one isn't even EOL. Well, it depends. Have a look at https://www.xilinx.com/support/documentation/data_sheets/ds180_7Series_Overview.pdf table 6. There are 325 of those 36 kB blocks on the FPGA (11700 kB in total), so you need about 1/4 of the total FPGA for memory. Technically feasible and easiest to implement but a very expensive FIFO. Now, connecting this chip to realize its full performance potential (e.g. 225 MHz) is not straightforward. From between the lines ... >> Do I must use all pins of external FIFO? ... I read that you don't have hands-on experience with e.g. CMOS ICs (the answer is "you must drive any input at any time, unless the data sheet says explicitly otherwise", or strange things will happen. "Strange" in a sense that the circuit may respond to waving my hands over it, and that's not an exaggeration). If I'm correct in this, it may be a good idea to get e.g. a few CD4017 or some other standard CMOS chip with simple functionality for < $1 and use this to bring up your FPGA IOs. If the FIFO chip doesn't work because of IO issues, it will be near-impossible to debug.
  17. you might give a bit more information to not be mistaken for a lazy student. My first thought is simply "do not". The component is EOL and you can have the same using the FPGA's BRAM with a LOT less hassle.
  18. https://www.xilinx.com/support/documentation/white_papers/wp389_Lowering_Power_at_28nm.pdf page 3
  19. >> is it possible to present DVFS on it. >> For now I now about clock wizard, DCM, PLL for different clock generation (frequency) but this is not frequency scaling mi right? you may have your own answer there. This is some university project? Have you done your own research? For example, this has all the right keywords: https://highlevel-synthesis.com/2017/04/12/voltage-scaling-on-xilinx-zynq
  20. if it helps, UARTs are extremely robust towards frequency error, in the order of percent (the protocol effectively wastes ~10% throughput on synchronization). The closest integer UART divider will probably work just fine. >> Synchronizer theory works only for clock domains that have totally independent sources. Not sure what you mean with that. A CDC needs to function at any possible phase delta between two clocks. If the two clocks are from the same source and co-periodic in some length, the random distribution of the phase looks different but it's just a special case and should still work.
  21. Just be aware that most of the "legacy" material on FIR filters limits itself to what can be presented conveniently. Numerical optimization is the tool of choice and there is no "cheating". Or, taking one step back to the filter specs, there is usually no need to specify a flat stopband and it can significantly reduce the required filter size (credits to Prof. Fred Harris) This only as.example where I can avoid unnecessary constraints from using a ready-made design process by writing my own solver. Which is actually not that hard, basing it on fminsolve or fminunc in Matlab / Octave. BTW, one reference on this topic I found useful: author = {Mathias Lang}, title = {Algorithms for the Constrained Design of Digital Filters with Arbitrary Magnitude and Phase Responses}, year = {1999} http://citeseerx.ist.psu.edu/viewdoc/summary?doi= The title alone is interesting - "Arbitrary Magnitude and Phase Responses" - no one says a digital filter needs to be flat or linear phase (of course, at the expense of symmetry). Sometimes I wonder, our thinking seems to often get stuck in patterns and "templates". Take analog filters, for example: Chebyshev, Butterworth or Bessel, which one do I pick? But those are just corners of the design space, and the best filter for a given application is most likely somewhere in-between, if I can only design it (which is, again, a job for a numerical optimizer, even though this one is more difficult).
  22. The word "overclocking" may be even misleading - this architecture is used when the data rate is significantly lower than the multiplier's speed. Inputs to one (expensive) multiplier are multiplexed, so it can do the work for several or all taps. The "multiplexing" itself will get fairly expensive in logic fabric due to the number of data bits and coefficients. To the rescue comes the BRAM, which is essentially a giant demultiplexer-multiplexer combo, with a football field of flipflops in-between. You can find an example for this approach here. Out-of-the-box, it's unfortunately quite complicated because it does arbitrary rational resampling. Setting the rates equal for a standard FIR filter, you end up with only a few lines of remaining code. BTW, multiplier count as design metric is probably overrated nowadays for several reasons (the IP tool resource usage is already more practical, e.g. BRAM count may become the bottleneck). If you can get this book through your library, you might have a look e.g. at chapter 6 (background only, this is a very old book): Keshab K. Parhi VLSI Digital Signal Processing Systems: Design and Implementation
  23. yes, and I'm sure there are nicer ones... if you pick another one, one easy way to bring up a UART is to wire the FPGA-side input to the FPGA-side output e.g. "plus one" (and the rx strobe / tx send wire). Then open teraterm, type an "a" and you should get a "b" etc.
  24. ... and this could be used as "UART" (minus the "R") module serialTx(clk, out_tx, in_byte, in_strobe, out_busy); parameter nBitCycles = 0; // set this according to the desired baudrate (e.g. 100 MHz clock, 9600 baud => use 100e6/9600) input clk; output out_tx; input [7:0] in_byte; input in_strobe; output out_busy; reg [31:0] count; reg [3:0] state = 0; reg [7:0] data; assign out_tx = (state == 0) ? 1'b1: // ready (state == 1) ? 1'b0: // start bit (state == 10) ? 1'b1: // stop bit data[0]; // data bits assign out_busy = (state != 0); always @(posedge clk) begin if (in_strobe) begin count <= nBitCycles; state <= 1; data <= in_byte; end else if (state != 0) begin if (count == 0) begin count <= nBitCycles; // (non-final) state <= state + 1; // (non-final) case (state) 1: begin // startbit end default: begin // data bits data <= {1'bx, data[7:1]}; end 10: begin // stop bit state <= 0; count <= 'bx; data <= 'bx; end endcase end else begin count <= count - 1; end end end endmodule