• Content Count

  • Joined

  • Last visited

  • Days Won


Posts posted by hamster

  1. A reasonable sized FPGA should be able to absorb your CPU designs and the Nexys Video is a pretty capable board.

    Is there any chance of including a "show_utilization -hierarchical" where we can review things?

    My feeling is that with a bit of though a lot of FFs and LUTRAMs could be revised to use more efficient, denser resources.

    Have you done much reading of the 7-series user guides, and the style guides that help you to write code that infer the most effective resources? 


  2. 26 minutes ago, zygot said:

    Mull over your sine LUT. BTW, why did you replicate 2 cycles instead of just using 1 in your LUT?


    Because I am quite happy to use a whole block of RAM rather than debugging indexing and sign-flipping code... :D It's actually one full cycle (half positive, half negative). It is jsut a column from a Google docs spreadsheet https://docs.google.com/spreadsheets/d/13srKHRNCD2dfbMglMvvCUESHR23kHWlzAJ24MJ_erzc/edit?usp=sharing

    I could have also got away with just one quadrant, but then it would be more 'active' code on what I'm not interested in playing with

  3. 7 hours ago, zygot said:

    I glanced at your code.

    The lines like this one bother me: new_val := dac1_accum + sample - 2048;

    I wonder how Vivado implemented this. Usually, in my experience a better way to do this is to pipeline so that each '+' operation is performed on a separate pipeline version of the signal. From a coding viewpoint it appears to be straightforward. From an implementation viewpoint it looks like you have in implied latch, at least. Regardless, you are trying to perform 2 add or one add and 1 subtract in a clock cycle. Even for a quick prototype exercise this isn't a good idea. Sometimes our HDL code gets written by the parts of our brain wired for C. It's dangerous.

    I can't see the latching issues, but agree that if the +/-2048 was a separate signal, then the code could be simplified quite a bit... however, the optimizer should be doing that at the moment. "Premature optimization is the root of all evil" and so on.

    One other finer point. As currently written +full scale value will generate a stream of all ones output but a -full scale won't generate all zero outputs, but a zero value gives a perfect 50:50 mix of ones and zeros.

    Others might need it that -full scale gives all zeros, and +full scale gives all ones, but a zero value will give slightly more zeros than ones.


  4. BASYS3 + PMOD Breadboard + Analog Discovery 2.

    It was just a hack, so the table was a quick formula in a spreadsheet, yes, I assume the + and - sides are both rounding towards zero causing some asymmetry, but with 11 significant bits that should be somewherere about -60dB at a guess.

    Most of the noise is just the shoddy physical implementation. Flying wires on a breadboard, on PMODs, just the shielded wires on the AD2 and so on.  If I leave a wire hanging around  it will pick up most the noise too, maybe 6dB lower than on the channel that is under measurement.

    This was just a quick experiment, just using the 100MHz clock rate. If I use a slower clock (e.g. update only every 8 cycles so 12.5MHz) the noise floor actually drops a lot..

    Also using the AD2 on a different laptop to the one programming/powering the FPGA removes a lot of noise too. I assume that this is due to voltage drops and noise on the USB cables.

    Plenty of room for experimentation and improvement.

  5. 41 minutes ago, CPerez10 said:

    Went through the code. It does indeed work, thank you. This raises a few questions though. Are VHDL logic vectors stored as little endian, or just iterated backwards? And is there a book/website on timing with VHDL? The button synchronization needs a bit of an explanation.

    I think of std_logic_vector the same way I would think of digits in a number...  the rightmost digit is digit zero. It most likely isn't the best way set things up for this example, but it avoids the need to swap the bit ordering in the ASCII characters.

    Oh, for the button synchronization...  signals take time to get across the chip (speed of light, capacitance and so on), so different parts of the design can see different values for the same signal as it change unless. As you can't control when the user might press the button you have to sample the value of the input signal on the clock edge,  holding that in a register. That registered value is then used drive the rest of your logic.

    There is a slight complication - If the signal from the button changes state *exactly* on the clock edge, the flipflop might not be able to correctly register as a 1 or a 0, but could be in some weird "metastable" state that takes a short while to become either a 1 or a 0. To stop this causing bugs in the operation of the logic deeper in the deign, the output of that fliplfop is then sampled a second time to get a "known good, either 1 or 0" signal, that can get to where it needs to within a clock cycle.

    Hence the design pattern... btn  gets sampled into btn_metastable (which is a bit dodgy if you use it), and then btn_metastable gets sampled into btn_synchronized, which is then used by the rest of the logic.

  6. Had a hack at it... tested working on BASYS3


    library IEEE;
    use IEEE.STD_LOGIC_1164.ALL;
    entity msg_repeater is
        Port ( clk : in  STD_LOGIC;
               btn : in  STD_LOGIC;
               tx  : out STD_LOGIC;
               led : out STD_LOGIC_VECTOR (3 DOWNTO 0));
    end msg_repeater;
    architecture Behavioral of msg_repeater is
       constant char_t     : std_logic_vector (7 downto 0) := "01010100";
       constant char_e     : std_logic_vector (7 downto 0) := "01000101";
       constant char_s     : std_logic_vector (7 downto 0) := "01010011";
       constant char_space : std_logic_vector (7 downto 0) := "00100000";
       -- Note message is sent from bit 0 to the highest
       signal msg : std_logic_vector (49 downto 0) := 
             char_space & "0" &   -- Character 4, with start bit
             "1" & char_t & "0" & -- Character 3, wrapped in start and stop bit
             "1" & char_s & "0" & -- Character 2, wrapped in start and stop bit
             "1" & char_e & "0" & -- Character 1, wrapped in start and stop bit
             "1" & char_t & "0" & -- Character 0, wrapped in start and stop bit 
             "1";                 -- Idle symbol and stop bit of the last character
       signal msg_index : unsigned( 7 downto 0) := (others =>'0');
       -- Should we send a message?
       signal triggered        : std_logic := '0';
       -- for generating the baud tick rate
       constant clock_rate     : natural := 100000000;
       constant baud_rate      : natural := 9600;
       signal baud_counter     : unsigned(27 downto 0) := (others => '0');
       signal baud_tick        : std_logic := '0';
       -- For the button synchonizer
       signal btn_synchronized : std_logic := '0';
       signal btn_metastable   : std_logic := '0';
          if rising_edge(clk) then
             -- Set the serial output bit
             tx  <= msg(to_integer(msg_index));
             led <= "0001";
             -- Controlling the message index
             if baud_tick = '1' then            
                if msg_index = 0 then
                   -- We are waiting to be triggered
                   if triggered = '1' then
                      msg_index <= msg_index + 1;
                   end if;
                elsif msg_index = msg'high then
                   -- We have finished the message
                   msg_index <= (others => '0');
                   triggered <= '0';
                   -- We are sending bits
                   msg_index <= msg_index + 1;
                end if;
             end if;
             -- Generating the baud tick
              if baud_counter < baud_rate then
                baud_tick <= '1';
                baud_counter <=  baud_counter - baud_rate + clock_rate;
                baud_tick <= '0';
                baud_counter <=  baud_counter - baud_rate;
             end if;
             -- Seeing if we are triggered
             if btn_synchronized  = '1' then
                triggered <= '1';
             end if;
             -- Synchronize the button with the clock domain
             btn_metastable <= btn;
             btn_synchronized <= btn_metastable;
          end if;
       end process;
    end Behavioral;


  7. Away from my laptop at the moment, but they way I would do this:

    A register sized to hold your clock rate (28 bits for 100MHz). 

    If it is less than the baud_rate, set 'bit_tick' to '1' and add (clock_rate- baud_rate) to the register. Otherwise set baud_tick to '0' and subtract the baud rate from the register. 

    That will give you 'bit_tick' that is 1 for the right number of cycles per second, and allow you to keep everything in the design running in the same clock domain. 

    You also want to have a synchronizer on your button, to make it work reliably. 

    You also have a problem in that when the button is lifted you will stop sending data straight away, so will most likely send an incomplete message. 

    Due to the way that VHDL signal assignments work you will try to send out bit 50 of your message, so that might mess things up. 

    Would you like me to have a crack at rewriting it for you, so you can see the difference?

  8. IC9 is the 100MHz oscillator. It is on the bottom, just to the right of center. 

    See the bottom left of page 5 of the schematics (on the resources page). 

    Part code is DSC1033CC1-100.0000T

  9. Ok - here's how to drive the seven segments, from 1000 feet up.

    You need to have the constraints for the segments and the anodes for the display. See the board's reference manual and master UCF file for them.

    ##7 segment display
    #set_property -dict { PACKAGE_PIN T10   IOSTANDARD LVCMOS33 } [get_ports { CA }]; #IO_L24N_T3_A00_D16_14 Sch=ca
    #set_property -dict { PACKAGE_PIN R10   IOSTANDARD LVCMOS33 } [get_ports { CB }]; #IO_25_14 Sch=cb
    #set_property -dict { PACKAGE_PIN K16   IOSTANDARD LVCMOS33 } [get_ports { CC }]; #IO_25_15 Sch=cc
    #set_property -dict { PACKAGE_PIN K13   IOSTANDARD LVCMOS33 } [get_ports { CD }]; #IO_L17P_T2_A26_15 Sch=cd
    #set_property -dict { PACKAGE_PIN P15   IOSTANDARD LVCMOS33 } [get_ports { CE }]; #IO_L13P_T2_MRCC_14 Sch=ce
    #set_property -dict { PACKAGE_PIN T11   IOSTANDARD LVCMOS33 } [get_ports { CF }]; #IO_L19P_T3_A10_D26_14 Sch=cf
    #set_property -dict { PACKAGE_PIN L18   IOSTANDARD LVCMOS33 } [get_ports { CG }]; #IO_L4P_T0_D04_14 Sch=cg
    #set_property -dict { PACKAGE_PIN H15   IOSTANDARD LVCMOS33 } [get_ports { DP }]; #IO_L19N_T3_A21_VREF_15 Sch=dp
    #set_property -dict { PACKAGE_PIN J17   IOSTANDARD LVCMOS33 } [get_ports { AN[0] }]; #IO_L23P_T3_FOE_B_15 Sch=an[0]
    #set_property -dict { PACKAGE_PIN J18   IOSTANDARD LVCMOS33 } [get_ports { AN[1] }]; #IO_L23N_T3_FWE_B_15 Sch=an[1]
    #set_property -dict { PACKAGE_PIN T9    IOSTANDARD LVCMOS33 } [get_ports { AN[2] }]; #IO_L24P_T3_A01_D17_14 Sch=an[2]
    #set_property -dict { PACKAGE_PIN J14   IOSTANDARD LVCMOS33 } [get_ports { AN[3] }]; #IO_L19P_T3_A22_15 Sch=an[3]
    #set_property -dict { PACKAGE_PIN P14   IOSTANDARD LVCMOS33 } [get_ports { AN[4] }]; #IO_L8N_T1_D12_14 Sch=an[4]
    #set_property -dict { PACKAGE_PIN T14   IOSTANDARD LVCMOS33 } [get_ports { AN[5] }]; #IO_L14P_T2_SRCC_14 Sch=an[5]
    #set_property -dict { PACKAGE_PIN K2    IOSTANDARD LVCMOS33 } [get_ports { AN[6] }]; #IO_L23P_T3_35 Sch=an[6]
    #set_property -dict { PACKAGE_PIN U13   IOSTANDARD LVCMOS33 } [get_ports { AN[7] }]; #IO_L23N_T3_A02_D18_14 Sch=an[7]

    I suggest you update them so the first 8 are called "segments[0]" through "segments[7]" and the last 8 are "anodes[0]" through "anodes[7]";

    You will also need constraints for the other signals you are using - e.g. the clock signal.

    You will then need to add the output to the top level design:

       segments : out std_logic_vector(7 downto 0);
       anodes  :out  std_logic_vector(7 downto 0);


    In your top level design. you can select which digit to light by setting one of the anodes to low, and also setting the patterns you want on the segments. For example:

      anodes <= "11111110";
      segments <="0101010";

    should light segment B, D, F and the Decimal port on digit 0 of the display. See section 9.1 of the reference manual to see which segment is where on the display.

    With that going you will then want to add a new sub-module to your design, that takes the 8 digits you want to display, and converts them to a pattern of segments and anodes, at a slow enough speed that they don't flicker (e.g. show each digit for 1/400th of a second, then move onto the next.

    The interface to this component will be something like:


    entity  seven_seg_display is 
       port (
          clk  : in std_logic;
          digit0 : in std_logic_vector(3 downto 0);
          digit1 : in std_logic_vector(3 downto 0);
          digit2 : in std_logic_vector(3 downto 0);
          digit3 : in std_logic_vector(3 downto 0);
          digit4 : in std_logic_vector(3 downto 0);
          digit5 : in std_logic_vector(3 downto 0);
          digit6 : in std_logic_vector(3 downto 0);
          digit7 : in std_logic_vector(3 downto 0);
          segments : out std_logic_vector(7 downto 0);
          anodes   : out std_logic_vector(7 downto 0)
    end component;

    Once you you have that module connected, and with the body of it written (it is a counter and a few case statements) you should then do a bit of testing (eg. connect the switches to the digits)  and should be ready to do display your data. This testing step is vital, to ensure that your LED patterns are correct, and you have got left and right correct for the anodes.

    Another test might be to create a 32-bit long counter, and then connect the digits to bit slices within that counter.

  10. Hi!

    RC4 was designed so that it doesn't map well to dedicated hardware, but uses only a small amount of compute power so it can be implemented on all but the tiniest microcontrollers, so implementing it in an FPGA is a great learning experience of what software problems are hard in hardware, and why.

    I looked into implementing RC4 a long while ago, and decided that a high performance implementation is pretty much impossible (where high performance is greater than one byte encoded/decoded per cycle).

    I've had a look at your code, and it looks like you are writing only for simulation. For example:

        -- Initialize and return an integer array 0 - 255
        function initialS256 return t_Integer_Array is
            variable S: t_Integer_Array(0 to 255);
                for i in 0 to 255 loop
                    S(i) := i;
                end loop;
                return S;
        end initialS256;

    This won't turn into useful hardware if you try to implement it in an FPGA.

    What you want to do is restructure it around some sort of block diagram style system, with a big focus on how you intend to store the array that holds the cipher state - is it all in a block RAM, held in distributed RAM, or just in 65366 flip-flops? Each has a different set of restrictions that limit how you can describe the hardware.

    What are you doing the project for? Is it for a course or self-directed learning? 

    As for the 7-seg, you should be able to find other posts covering that here and/or in the FPGA reference manual, or the Digilent GithHub repo.

  11. On this topic I've been making an audio DSP board using the CMOD A7, where additional noise is a real pain.

    My initial prototype board had some audio noise problems - I couldn't hear it but I could measure it. I initally thought was due to the CMOD-A7 and could not be fixed, but eventually put down to quite a few different causes:

    - I had nearly shorted the output of one of the DAC to GND, which as causing spikes on the power rail. Once fixed things were a lot better, but not perfect/

    - I had not made any real attempt to stitch the top fill to the ground plain on the bottom - after all it was a hack.

    - I didn't have any series resistors in the I2S lines. I added 50 ohm ones (just picked a random value out of the air - might look at this again)

    - I had a few capacitor bodges standing up in the air, which could only make things worse

    - I was measuring very close to the FPGA, with a high impedance scope probe

    So I addressed all of these in the next prototype, and made up a test jig allowing me to measure 30cm from the board and things are much better - to the point I can't reliably measure any additional noise in the audio band.

    I guess what I am trying to say is that even with just one GND pin the CMOD-A7 can be part of a low noise audio system, but you have to put some extra thinking and work in to make it happen. 

    This may or may not be of use to your use-case.

  12. The last of the parts came in and the new board is up and running.

    Here's the old and new boards side by side, and spectrum of a 10kHz test tone going from the ADC, through the FPGA and then DAC (top = new board, middle = old board, bottom = no board in the loop.

    The additional work I did on grounding on the PCB has paid off, with a very good noise floor - better than I can measure with the tools I have to hand.




  13. 3 hours ago, xc6lx45 said:

    For comparison, I got the following LUT counts for James Bowman's J1B (16 bit instruction, 32 bit ALU) CPU which I know quite well:

    * 673 LUTs = 3.3% utilization of A7-35 with 32 stack levels in distributed RAM (replacing the original shift register based stack which does not look efficient on Xilinx 7 series)
    * 526 LUTs if reducing the +/- 32 bit barrel shifter to +/- 1 bit, but the performance penalty is severe (e.g. IMM values need to be constructed from shifts).
    * 453 LUTs if further allowing one BRAM18 for each of the two stacks. This includes a UART and runs at slightly more than 100 MHz but memory/IO need two instructions / two cycles.

    So the RISC "overhead" does not seem that dramatic. It's slightly bigger, somewhat slower but has baseline opcodes (e.g. arithmetic shift and subtract, if I read it correctly) that J1B needs to emulate in SW).

    It would be interesting to know where the memory footprint goes when I use (soft) floats. I've done the experiment in the recent past with microblaze MCS, and did not like what I saw. On J1B I need about 320 bytes for (non IEEE 754) float + -  * / painfully slow without any hardware support but it keeps the boat afloat, so to speak.

    Using C instead of bare metal assembly would be tempting.... I just wonder how much effort it takes to install the toolchain.


    I just had a look at the J1b source, and saw something of interest (well, at least to weird old me):

            4'b1001: _st0 = st1 >> st0[3:0];
            4'b1101: _st0 = st1 << st0[3:0];

    A 32-bit shifter takes two and a half levels of 4-input, -2 select MUXs per input bit PER DIRECTION (left or right) and the final selection between the two takes another half a LUT, so about 160 LUTs in total (which agrees with the numbers above)

    However, if you optionally reverse the order of bits going in, and then also reverse them going out of the shifter, then the same shifter logic can do both left and right shifts.

    This needs only three and a half levels of LUT6s, and no output MUX is needed. That is somewhere between 96 and 128 LUTs, saving maybe up to 64 LUTs.

    It's a few more lines of quite ugly code, but might save ~10% of logic and may not hit performance (unless the shifter becomes the critical path...).

  14. The toolchain is pretty simple to build but takes a while - for me it was just clone https://github.com/riscv/riscv-gnu-toolchain, make /opt/riscv (and change ownership), then run './configure' with the correct options, then 'make'.  There are a whole lot of different Instruction set options and ABIs, so I definitely recommend building from source rather than downloading prebuild images.

    At the moment I haven't included any of the stdlib or soft floating point. I'll add that to the "todo someday" list.

  15. I've just posted my holiday project to Github - Rudi-RV32I - https://github.com/hamsternz/Rudi-RV32I

    It is a 32-bit CPU, memory and peripherals for a simple RISC-V microcontroller-sized system for use in an FPGA.

    A very compact implementation and can use under 750 LUTs and as little as two block RAMs -  < 10% of an Artix-7 15T.

    All instructions can run in a single cycle, at around 50MHz to 75MHz. Actual performance currently depends on the complexity of system bus.

    It has full support for the RISC-V RV32I instructions, and has supporting files that allow you to use the RISC-V GNU toolchain (i.e. standard GCC C compiler) to compile programs and run them on your FPGA board. 

    Here is an example of the sort of code I'm running on it - a simple echo test:, that counts characters on the GPIO port that I have connected to the LEDs.

    // These match the address of the peripherals on the system bus.
    volatile char *serial_tx        = (char *)0xE0000000;
    volatile char *serial_tx_full   = (char *)0xE0000004;
    volatile char *serial_rx        = (char *)0xE0000008;
    volatile char *serial_rx_empty  = (char *)0xE000000C;
    volatile int  *gpio_value       = (int  *)0xE0000010;
    volatile int  *gpio_direction   = (int  *)0xE0000014;
    int getchar(void) {
      // Wait until status is zero 
      while(*serial_rx_empty) {
      // Output character
      return *serial_rx;
    int putchar(int c) {
      // Wait until status is zero 
      while(*serial_tx_full) {
      // Output character
      *serial_tx = c;
      return c;
    int puts(char *s) {
        int n = 0;
        while(*s) {
        return n;
    int test_program(void) {
      puts("System restart\r\n");  
      /* Run a serial port echo */
      *gpio_direction = 0xFFFF;
      while(1) {
        *gpio_value = *gpio_value + 1;
      return 0;

    As it doesn't have interrupts it isn't really a general purpose CPU, but somebody might find it useful for command and control of a larger FPGA project (converting button presses or serial data into control signals). It is released under the MIT license, so you can do pretty much whatever you want with it.

    Oh, all resources are inferred, so it is easily ported to different vendor FPGAs (unlike vendor IP controllers)

  16. WAV files are the simplest to work with.

    1. The WAV file have s small header on it, then they are all raw sample data, usually stereo pairs of 16-bit signed numbers. Just write a small program in your favorite scripting language to print out data after about 64 bytes.

    2. For phone-quality audio, you need bandwidth of 300Hz to 3kHz. -  this needs around 8000 samples per second, and about 8-bit sample depth . You could use some u-law or a-law compression to increase dynamic range (https://en.wikipedia.org/wiki/Μ-law_algorithm)

    3. - 8 kilobyes per second, if you play raw 8-bit samples.

    Oh, and to convert data from a WAV file to lower sample rates (e.g. from 48kS/s to 8kS/s) you can't just drop 5 out of six samples - you need to first filter off the frequencies greater than half the target sample rate. It's not that challenging to actually do in code (usually just a couple of 'for' loops around something like "out[x] += in[x+i] * filter[j]') but generating the magic values for the filter can be interesting.


  17. The "DC and Switching characteristics" tells you the delays in the primatives, but can't tell you the routing delays. The only way to truly know it to build the design in Vivado, and then look at the timing report. 

    Inference of DSP blocks and features is pretty good as long as your design is structured to map onto the DSP slices. There are little gotchas like not attempting to reset registers in the DSP slice that don't support it.

    Skim reading the DSP48 User Guide will pay off many times over in time saved from not having to redesign stuff over and over to help it map to the hardware. 

  18. My views - if you want to learn low-level stuff (eg. VHDL/Verilog coding), buy a board with lots of buttons, LEDs, switches and different I/O over a more application specific development board. I think think that the Basys3 is pretty good for this and better than the Arty. Once you have sharpened your skills, then look for a board that will support your projects.

    If you want to initially work at a systems level, using IP blocks and so on, then look for a board that has interfaces that supports your area of interest. Debugging H/W when you are also debugging FPGA designs is no fun. A Zynq based board (e.g. Zybo) would be good, as it already a CPU, that is much better (faster, less power, better features) than a CPU you could implement in the FPGA fabric. Just be warned that with a Zynq system the SDRAM memory is usually on the far side of the processor system, so you don't get direct access to it - you need to access it over an AXI interface and compete with the CPU for bandwidth.

  19. 2 hours ago, skylape said:

    Would something like this work https://www.xilinx.com/support/documentation/ip_documentation/div_gen/v5_1/pg151-div-gen.pdf ? It is a IP wizard from xillinx. 

    It may well do, but not knowing *all* the details of what you are doing means I can't offer you useful advice. 

  20. 54 minutes ago, Andrew Touma said:

    Thank you very much for the assistance! I was able to make the necessary corrections and my project is running smoothly now. I also found some other errors in the logic of my code, which I have corrected as well. 

    Yay! Glad to have helped.