xc6lx45

Members
  • Content Count

    427
  • Joined

  • Last visited

  • Days Won

    23

Everything posted by xc6lx45

  1. >> I never do get to the "All tests pass..." declaration that I'm expecting from the program.cs source. True. It's edited to run forever (counts up to minus 1 => the loop never terminates). Once you see repetitions in the console, it has passed all tests once. I sometimes leave it running overnight but haven't yet witnessed any random bit errors. With the USER2 "demo", it's just the same: It checks correctness of the received data - as long as there are no exceptions, it passes. This could be used as a full-duplex throughput test, since there is no protocol overhead.
  2. The "problem" with sharpDevelop was one of default settings. It had arithmetic overflow protection enabled, whereas my Visual Studio project disables it in the "advanced" options. ================================================================================ I added a 2nd, simplified example, on github. It runs independently in the same design, on USER1- and USER2 JTAG opcodes. The new variant is "almost" like a UART, except that the data stream is continuous (e.g. may design the application FPGA code to send zero bytes if there is no return data)
  3. OK that is exactly where I made the change for sharpDevelop. Could you please double-check that the line reads retVal |= (UInt32)this.jtag.io.readData[offset++] << 24; in the old version it was (note the brackets!) retVal |= (UInt32)(this.jtag.io.readData[offset++] << 24); The red one works in VS but not sharpDevelop.
  4. Sorry but it works on both my Windows 8 and Windows 7 machine (both 64 bit, though). Does one of the changed lines from this commit look familiar? https://github.com/mnentwig/busbridge3/commit/887d1d3984dbf69ab5a7b72398d90ddde98ab0a9 If so, could you please try a clean checkout? Otherwise, the error message would be helpful (memTest is almost "toplevel", everything happens in there).
  5. Thanks. Please check gitHub again, I added a fourth folder "sharpDevelop_build" (5.1 from sourceforge). This should run out of the box, if a CMOD A7 35 is found, and no other Digilent USB interface is present. If more debugging is needed, my exception handler might get into the way. I'd comment out the try/catch. Alternatively, enable "pause on exception", as shown in the screenshot here: http://community.sharpdevelop.net/blogs/siegfried_pammer/archive/2014/08/27/customizing-quot-pause-on-handled-exceptions-quot-in-sharpdevelop-5.aspx When porting to sharpDevelop I ran into an issue with type casting and bit shifting: Running the older code on sharpDevelop will cause an overflow error. Don't know why, probably different interpretations (or versions?) of the standard. Whatever, it is fixed now by stating the problematic cast explicitly. To get rid of the relative link to top.bit (it's ugly), change this line in Program.cs bitstreamFile = @"..\..\..\busBridge3_RTL\busBridge3_RTL.runs\impl_1\top.bit"; Console.WriteLine("DEBUG: Trying to open bitstream from "+bitstreamFile); e.g. to bitstreamFile = "top.bit"; and copy top.bit into the folder where the .exe file is generated.
  6. Yes, most likely this is the reason: Your design fails timing and does not work. Please don't assume that small amounts of TNS fail "soft" (but don't assume either that functional failure is guaranteed). The attempt to P&R an impossible design may wreck other parts that would have been uncritical otherwise. You're trying to run SHA1 in a single clock cycle. This is very inefficient use of the FPGA. I haven't tried this example myself but I believe you can let register rebalancing do the work to turn it into a feasible design if you simply delay the output by e.g. 100 clock cycles via a shift register. It should be available in the options, if it isn't already enabled by default. Done correctly, you can have many independent calculations in flight at the same time ("pipelined") and get much better overall performance from the FPGA logic.
  7. Thanks - if you manage to get it running, the latency number should be printed to the console from this part in Program.cs: sw2.Reset(); sw2.Start(); int nRep = 1000; m.memTest32(memSize: 1,baseAddr: 0x87654321,nIter: nRep); Console.WriteLine("roundtrip time "+((double)sw2.ElapsedMilliseconds/(double)nRep)+" ms (should be close to 0.125 ms from 8 kHz USB 2.0 microframe rate)"); Throughput is another matter - I didn't profile it today but if I recall correctly, it reaches 20+ MBit/s out of the box. There is an FTDI whitepaper on optimal buffer settings, there may be some opportunity for tweaking at FTDI driver level and in the device manager. Starting from Visual Studio disables some of the JIT optimizations (I think). For profiling, run the .exe straight from windows explorer. There was a missing code line in bb3_lvl2_io.cs that shows only in low-level JTAG read access, e.g. if I use it as plain JTAG interface to connect straight to BSCANE2, or something like that. It's fixed now.
  8. Hi, as I may not have time for FPGA work for a while - just started in a fascinating new role related to high-speed digital diaper changing - I decided to post this now. Here's the Github repo (MIT-licensed) The project provides a very fast (compared to UART) interface via the ubiquitous FTDI chip to a Xilinx FPGA via JTAG. Most importantly, it achieves 125 us response time (roundtrip latency), which is e.g. 20..1000x faster than a USB sound card. It also reaches significantly higher throughput than a UART, since it is based on the MPSSE mode of the FTDI chip. Finally, it comes with a built-in bitstream uploader, which may be useful on its own. I implemented only the JTAG state transitions that I need but in principle this can be easily copy-/pasted for custom JTAG interfacing. So what do you get: On the software side an API (C#) that bundles transactions, e.g. scattered reads and writes, executes them in bulk and returns readback data On the RTL side a very basic 32 bit bus-style interface that outputs the write data and accepts readback data, which must be provided in time. See the caveats. In general, a significant increase in complexity over a UART. The performance comes at a price. In other words, if a UART will do the job for you, DO NOT use this project. For more info, please see the repo's readme file. For CMOD A7-35, it should build right out-of-the-box. For smaller FPGAs, comment out the block ram and memory test routines, or reduce the memory size in top.v and Program.cs. I hope this is useful. When I talked to the FTDI guys at Electronica last week I did not get the impression that USB 3.0 will make FT2232H obsolete any time soon for FPGA: They have newer chips and modules but it didn't seem nearly as convenient, e.g. the modules are large and require high density connectors. In FPGA-land, I think USB 2.0 is going to stay... Cheers Markus
  9. well, impossibility is a matter of scale... yes, with Terahash rates from ASIC miners this may make sense.
  10. This is the worst good news I've heard in a while 🙂 pow(2, 64) is an impractically large number. The NSA might pull it off but even so, chances are good that by then the bitcoin price has gone below zero.
  11. just to warn you if the money is important to you: The learning curve is much (much much) steeper than most people anticipate. FPGA alone is a brain-twister, we're used to sequential programming when an FPGA operates in parallel. Combine that with a state-of-the-art CPU - most people aren't used to situations where documentation alone is measured in shelf-meters. Just be aware that I can spend working months just learning, and my mental "map" is still largely blank spots (for example, ARM security is a huge topic one needs to be at least aware of). If you have the patience, it may be the most efficient $100 you have ever invested into your career.
  12. xc6lx45

    Voice-activited

    >> Is that solution Will run? I suspect you mean crosscorrelation, and no, it will most likely not work. Maybe you'll save yourself much pain if you prototype the algorithm first in software. It doesn't need to be real time. E.g. get freeware Octave and use the audioread() function. Be sure to use two independent recordings for reference and simulated microphone input.
  13. But it cannot drive 1.8 V if the pin is located on a bank supplied by 1.2 V e.g. for the DRAM.
  14. Hi, this is just a general comment (I don't own the board, maybe someone else is more familiar with this particular issue): First, KU040 is the name of a chip, not a board. I assume you are referring to this one (manual). If so, bank 45 for DDR is supplied by 1.2 V. Apparently, the "gt_reset" pin in your toplevel design wants to drive 1.8 V. This is physically impossible (one bank has only a single positive rail voltage). Possible solutions: change the IO voltage of gt_reset to 1.2 V. For the above-mentioned board, Push_SW[0:4], DIP_SW[0:7] on bank 45 should be driven by 1.2 V on the other end of the switch, if the board is designed correctly. double-check that gt_reset is an input (AFAIK, mixing IO voltages is OK for pins that are input-only) or move gt_reset to a different bank that is supplied by 1.8 V
  15. Hi, beware of eval boards. They are not modules, and vendors often actively resist supplying them in quantity. One example from my own past for the same vendor, years ago, a gigasample DAC board with HSMC connector (similar to FMC) listed at $499. It took (I think) a month to get it, and when it arrived, the new revision had simplified supporting circuitry, with some features missing.
  16. Hi, on PL, you'll find the air gets thinner when you run out of resources in a sense that P&R slows down, and it gets harder to close timing. This is because lack of resources adds additional constraints, compared to a sparse design. Most likely, you won't run into space constraints if you write the logic from scratch - If you manage to use up tens of thousands of LUTs with code that was hand-written from scratch by a single person within a few weeks, chances are high there is something fundamentally wrong with the architecture or methodology. This is meant just something to think about, it's easy to come up with counterexamples e.g. highly parallel designs like bitcoin mining or a DIY GPU). On the other hand, clicking through a few IP wizards will easily create something huge. This isn't surprising if I keep in mind that FPGA vendors sell silicon by the square meter... For me, the most likely bottleneck feature is BRAM, because keeping memory accesses within the FPGA, within a single clock domain and at a constant small latency may dramatically reduce complexity. It also provides crazy memory bandwidth but you need to re-think the algorithm around a few corners, e.g. in the neural network you mentioned, dedicated BRAMS for input / hidden / output layer, weights, biases, MACs for parallel computation of row-column products and a mux or two for the data routing (mild understatement...) Thinking back, my first "serious" FPGA adventure was to put a lua interpreter on a softcore processor. It felt like "wow -wasn't this too easy?". The next day I found this to be correct, realizing that I was using up most of the BRAM of a USD 7k part... On the other hand, +$50 for a low-end FPGAs can be the difference between a one-day hack and something in need of a project plan. If you're serious about FPGAs, saving $50 isn't worth it, at any level of competence. On the other hand, if you're not sure and just want to try, there's nothing wrong with getting the cheapest board. Just consider it disposable (and, having a cheaper board I'm not afraid to damage may sometimes even be the better choice)
  17. I just post a link, here:https://ccrma.stanford.edu/~jos/filters/Minimum_Phase_Polynomials.html it's not necessary but a common mathematical shortcut for cabinet modeling-like applications: it calculates a shorter impulse response with an identical frequency response. It doesn't preserve time dispersion, so it doesn't work for reverb-type applications.
  18. ...speaking of which: a minimum-phase transform is easily done in e.g. Octave, simply roots(), and mirror selected zeros to the other side of the unity circle (mag := 1/mag) Just mentioning this because it may be a necessary pre-processing step to get the job done without a PC size power supply (=identical magnitude response with the shortest possible FIR length). Whether or not it sounds the same, depends (e.g. a cabinet - yes. A cathedral - probably not).
  19. Hi, I'm not aware of any such instructions. It's a valid concern if you fully use the MAC capabilities of a 100T but burning an FPGA would be fairly low on my list of things to worry about. If in doubt, you can enable automatic thermal shutdown by instantiating an XADC (see page 68 of https://www.xilinx.com/support/documentation/user_guides/ug480_7Series_XADC.pdf). A convolution reverb is a pretty good fit to the chip, not necessarily commercially viable but easy to implement. You'll have no difficulty to mux e.g 1000 audio sample MACs at 96 kHz on a single hardware multiplier (of which there are 160, but the block RAMS will most likely be the bottleneck). 25 bit data path and 18 bit coefficients seems the logical choice, especially since RAM comes in multiples of 9 bits, not 8 (the "parity bit"). My guess is that the supporting infrastructure (e.g. how to upload coefficients into RAM unless you want one audio sample "hard-coded", interfacing with the codec) is much more work than a brute-force convolution algorithm. A possibility is to assign one side of the dual port RAM to the convolution algorithm, the other side to configuration.
  20. laughing out loud... Welcome to the wonderworld of 2018 test engineering. Exciting new bugs are waiting for YOU to be discovered - it's as simple as trying to use an instrument for its advertised purpose (sorry, couldn't resist...)
  21. >> remote data logging application, maybe running for months One RPI option is to drive an external ADC through SPI, if that's sufficient for the data acquisition job. I've used this approach in the past with an MBED LPC1768 as host, and this ADC module http://papilio.cc/index.php?n=Papilio.AnalogGroveWing. I'm sure similar PMOD modules can be found on this site.
  22. ... actually, as it's the deassertion edge of WE that counts for level-driven inputs, I think the write logic needs to be like this: - cycle 0: set address, data - cycle 1: hold address, data, assert WE (stable address is necessary to not overwrite other addresses randomly) - cycle 2: hold address, data, deassert WE (stable address and data is necessary since this ends the write operation)
  23. my theory is the deassertion of WE in the transition of wr0 to idle. Right now it's uncontrolled in a sense that the address might change nanoseconds before WE returns to non-asserted. Try to add a wr1 state that only changes the enable lines and leaves address / data untouched.
  24. What I meant is that the old version updates address_input only on start_operation, but the new one updates it continuously. Thinking about it, I'm not sure if this really is the root cause. The version from your last post would seem the most logical, though (for readability, not functionality). I'm in a bit of a hurry right now, but I'd check all warnings. Can I explain every one to yourself? Also review the data sheet / timing. I can only encourage you not to leave any unsolved problems behind. It's annoying and hard work but the potential insights are too valuable .
  25. Hi, the schematic is here. Double-check your constraints file: GCLK should be pin L17 (sheet 3): https://reference.digilentinc.com/_media/reference/programmable-logic/cmod-a7/cmod_a7_sch.pdf You don't need to write to flash: a bitstream .bit file can be uploaded directly to the FPGA. Much faster