Piasa

Members
  • Content count

    41
  • Joined

  • Last visited

  • Days Won

    1

Piasa last won the day on November 20 2017

Piasa had the most liked content!

About Piasa

  • Rank
    Frequent Visitor
  1. Addition, but not for numbers!

    For FPGAs, take the input and make a bit-reversed version. This can be done with a function in VHDL -- reverses a std_logic_vector and returns a value with the same range as input. function reverse(x : std_logic_vector) return std_logic_vector is variable xnml : std_logic_vector(x'length-1 downto 0) := x; variable rev : std_logic_vector(0 to x'length-1); variable result : std_logic_vector(x'range); begin for i in xnml'range loop rev(i) := xnml(i); end loop; result := rev; return result; end function; -- find leftmost 1 and return a 1 hot version function leftmost(x : std_logic_vector) return std_logic_vector is variable xnml : std_logic_vector(x'length-1 downto 0) := x; variable rev : signed(x'length-1 downto 0); begin rev := signed(reverse(xnml)); return reverse(std_logic_vector( rev & -rev)); end function; This requires the unary "-" function, which can be imported from std_logic_signed, or by having a signed vector. If you want the index of the leftmost 1, that would be a priority encoder.
  2. Addition, but not for numbers!

    This is a fun and short post on how addition/subtraction can be used for logic and not just numbers. The example I use is "x & -x". This expression takes a vector, finds the rightmost 1 and sets all left bits to 0. The right bits are already 0. If the input is 0, the expression returns an all 0 vector. Thus the expression either gives a 1-hot vector of the rightmost 1, or gives 0 if there is no bit set to 1. There are a lot of interesting expressions. But some are imperfect for use. For example "x | (x-1)" will find the rightmost 1 and set all bits to the right to 1. But if the input is 0 and a 1 isn't found the result is all 1's. I've provided a short table on some of the expressions and what they do and what they do in the all 0 or all 1 case. I think it is accurate but I haven't fully checked all of them. These are rarely used, but are useful for inductive logic. I've found this to be a useful interview question as it allows someone to show they can take an expression like "x & (x-1)" and describe what it does from a practical perspective. It also allows them to describe why the implementation might be good or bad vs other coding choices. The use of the carry chain means the logic will hotspot more than versions that don't use the carry chain. But if the logic is very local this isn't an issue. +-----------------------------------------+------------+----------+ | Task | expression | not found| +-----------------------------------------+------------+----------+ | find the rightmost 1: | | | | leave unchanged (1) | | | | leave bits on left unchanged | | | | leave bits on right unchanged (0) | x | 0 -> 0 | | set bits on right to 1: | x | (x-1) | 0 -> -1 | | set bits on left to 0: | | | | leave bits on right unchanged (0) | x & -x | 0 -> 0 | | set bits on right to 1: | x ^ (x-1) | 0 -> -1 | | set bits on left to 1: | | | | leave bits on right unchanged (0) | x | -x | 0 -> -1 | | set bits on right to 1: | -1 | 0 -> -1 | +-----------------------------------------+------------+----------+ | find the rightmost 1: | | | | clear (0) | | | | leave bits on left unchanged | | | | leave bits on right unchanged (0) | x & (x-1) | 0 -> 0 | | set bits on right to 1: | x-1 | 0 -> -1 | | set bits on left to 0: | | | | leave bits on right unchanged (0) | 0 | 0 -> 0 | | set bits on right to 1: | ~x & (x-1) | 0 -> -1 | | set bits on left to 1: | | | | leave bits on right unchanged (0) | x ^ -x | 0 -> -1 | | set bits on right to 1: | ~x | (x-1) | 0 -> -1 | +-----------------------------------------+------------+----------+ | Task | expression | not found| +-----------------------------------------+------------+----------+ | find the rightmost 0: | | | | leave unchanged (0) | | | | leave bits on left unchanged | | | | leave bits on right unchanged (1) | x | -1 -> -1 | | set bits on right to 0: | x & (x+1) | -1 -> 0 | | set bits on left to 0: | | | | leave bits on right unchanged (1) | x & (~x-1) | -1 -> -1 | | set bits on right to 0: | 0 | -1 -> 0 | | set bits on left to 1: | | | | leave bits on right unchanged (1) | x | (~x-1) | -1 -> -1 | | set bits on right to 0: | ~x ^ (x+1) | -1 -> 0 | +-----------------------------------------+------------+----------+ | find the rightmost 0: | | | | set (1) | | | | leave bits on left unchanged | | | | leave bits on right unchanged (1) | x | (x+1) | -1 -> -1 | | set bits on right to 0: | x+1 | -1 -> 0 | | set bits on left to 0: | | | | leave bits on right unchanged (1) | x ^ (x+1) | -1 -> -1 | | set bits on right to 0: | ~x & (x+1) | -1 -> 0 | | set bits on left to 1: | | | | leave bits on right unchanged (1) | -1 | -1 -> -1 | | set bits on right to 0: | ~x | (x+1) | -1 -> 0 | +-----------------------------------------+------------+----------+ For cases where the all 0 or all 1 case results in an undesired value, you might need to either do a compare and mux, or extend the vector by 1 bit and use the msb for a mux. Hopefully some of these expressions are useful or educational.
  3. XADC conversion rate

    You want to use an MMCM or clocking IP core of some form in order to get a 104MHz clock that can be used to get 1000MSPS vs 25/26ths that rate.
  4. Feedback on a register file design?

    You can also import just "+" and "-" from std_logic_signed as well as the conversion functions from std_logic_arith. This way you still are required to specify signed/unsigned for "<", "*", etc... Importing "-" from std_logic_signed will give you the unary "-", which can be used for logical induction in expressions like (-x) and (x).
  5. FIFO CDC and Gray codes

    You might be re-inventing the tree. j/k. FPGAs have async fifo's as hard IP. They also have vendor supplied IP generators. For xilinx, the "primitives guide" has more info. There is also coregen. Someone also mentioned there is a new way to create generic fifo's that isn't the very-limited unimacros. A properly constrained design will ensure that the skew between bits in this bus will be less than one other-clock cycle. In newer Vivado, there is a constraint for this specifically. In older versions of Vivado and all of ISE, the bus was simply constrained to arrive to the other clock domain within one other-clock cycle. This is a little more strict, but is safe. The constraints guide has more info on what your options are based on tool version. But again, for a first design I would avoid manually creating a dual-clock fifo unless it has some specific use that can't be met using pre-verified options. As @zygot mentions, there can be some unexpected latency differences for various flags. The xilinx memory resources guide can give you some information for common use cases. For coregen'd fifos, the relevant user guide (product guide?) can also be useful. I also suggest creating overflow/underflow error flags and having some way to access them. These can be useful in simulation and can be useful to determine issues that occur in difficult to simulate cases.
  6. Feedback on a register file design?

    Probably not. You just need to be aware that the register outputs have some delay from the logic that is in the register file. Registering outputs is "generally good" design. However, it isn't always needed or possible. It is up to you to decide the logical impact in this case and then compare against any performance benefits.
  7. Feedback on a register file design?

    "Last signal assignment wins". I'm not sure if you are pointing out that I missed a word or if you are questioning the LRM.
  8. Feedback on a register file design?

    Y There are several common things that should result in this. Linked-lists are one example, but there are probably a dozen others. Of course that is off topic from a code review of a register file for a cpu that doesn't currently have a C/C++ compiler.
  9. Feedback on a register file design?

    Just realized I didn't respond to this yesterday. By infrastructure, I mean clocks and general resets that are used at startup or otherwise not used during normal operating conditions. The distributed memory approach was presented a few posts up. It doesn't make sense for your goal of an easy to read, general implementation. In many designs, it is preferred to have output registers. This does add the 1 register delay. In some cases the unregistered outputs are unavoidable. When possible, having output registers is nice because the longest path won't be half in one file and half in another. For re-use, remember that you can't control how other people use your module. "last signal wins" is a reliable behavior. That said, it can be abused. There are structured uses where it can be very useful. However, it creates a bottom-to-top priority structure. For this reason, it should only be used in a manner that is unlikely to confuse a reader.
  10. Feedback on a register file design?

    BRAM does, but then you can't read in the same cycle. A Xilinx targeted register file could use DMEM. This requires four copies of each register for the 4-6 read/cycle case, or two copies for the 1-3 read/cycle case. The DMEM have a 3-read, 1-write config. To get 4-6 ports means two copies. To get the two writes again means two copies and the addition of a small tag ram which can be implemented using registers. It is debatable if this is that much better as both should be small for modern FPGAs. It removes the input muxes for the priority logic as portA now only writes to two DMEM and portB only writes to the others. It also removes the output muxes as these are built into the DMEM. The clock to out is higher than registers, but I'm not sure if it is higher than register + LUT6. The complexity is that the OS either needs to clear the registers at start, or accept the bootup values could be random. Also, the bits per slice is lower, but given the lack of extra muxes this is probably not a concern. There are also benefits if the design can use either 32 registers or can use two sets of registers as the DMEM is a 32b config. This can be used for fast interrupt context swaps or for barrel processors. In terms of coherency, that is based on if the CPU can ensure it never has a write-write conflict and also can avoid read-before-write where write could now be from two ports. In one design, there was a custom written 32b adder that was instantiated in the same file as a 40+ bit adder that ran at the same clock. (neither adder was in the top 100 nets and the design met timing). One of these took hours to write, the other took seconds. I also notice some people add lots of pipeline stages for simple calculations. This can be fine, but each pipeline stage increases the chance that a future modification will have a pipeline error. Because this might only show up in rare case, I take steps to ensure the pipeline naming scheme and intent is clear /wrt cycle vs sample delays. This is especially true when simplified assumptions about the pipeline are no longer true when the module is ported to a new application. This commentary is probably best suited for another thread as it does not related to the topic of a CPU register file.
  11. Feedback on a register file design?

    I think in this case the priority logic is hard to remove in a way that isn't worse. Some (or all?) synthesis tools will ignore 'x' and '-' and instead replace them with '0'. This can add some extra logic to do something you specifically didn't care about. Also, I agree that small sandbox designs can be really fun and informative. IMO, devs tend to underestimate the FPGA in some ways which results in excessive pre-optimization.
  12. Feedback on a register file design?

    it looks fine. I would normally have a different port order and have each on a different line for easy copy/paste. I prefer to have interfaces -- readA, doutA, readB, doutB, etc... I also place interfaces in an output, input, config, infrastructure order. In industry the order of infrastructure, input, output, config is more common. For implementation, this probably infers registers. It is possible to construct this with distributed memory, although it is more complex. It isn't clear to me if the added complexity results in a better design at this size. The design actually will have priority logic for data_inB if the same address is used. This is because the last reached assignment will be used. Also, the logic has unregistered outputs. Normally this isn't something that is desired. This means critical timing paths could be due to logic in multiple modules. Not sure if there is anything that can be done here though. You can also add asserts for the "write to same address" case. this can be helpful in simulation.
  13. Audio processing

    It isn't ideal. This is a common case of half-generalizing code. The declared type for coefficients is generic but then restricted to a specific width in practice. My guess is that the original design had known sizes for coefs/inputs but also made them generics for "good practice". Many practical designs are never tested outside of the original use case. That the filter clearly uses signed values as "std_logic_vector" is also concerning. Also, the values seem to be scaled in a sub-optimal manner -- none are close to max magnitude. Like someone assumed intermediate stages had to have the same bit width as the output and also didn't know that FPGAs have had 18 bit multiplies for over a decade. In both Verilog and VHDL the coefs should either be loaded from a file or be dynamically loaded after the fpga is programmed. (also, this is clearly a symmetric FIR filter where someone also had a copy-paste error to make it asymmetric and then also not use an optimal implementation. This is also probably not ideal for audio as attack/decay is possibly a concern over timeless spectrum.) --edit: This comment might have been harsh. It might be that the original poster is also the person I criticized here. In that case I would have been critical of the person asking for help from the community. That was not my intent as my post was intended to help that person in solving their issues.
  14. Non-clocked synchronous circuits

    Using posedge on fabric logic can have a few issues. If the fabric-generated clock doesn't come directly from a register -- eg if you have c = ( a == b ) -- then you can generate glitches. It might be that as new values of a,b are being propagated to the logic the condition is met one or more times within a normal cycle. This can generate short pulses which might trigger some registers but not others. This is also true for async set/reset logic. When a fabric generated clock comes from a register or doesn't have glitches, the clock might be ok to use. There are still some issues. First, this design style is more prone to generating a larger number of clocks, which might exaust the clock routing for a given clock region. Second, the clock might will have routing delays that change from build to build as well as over temperature. This means the clock must be treated as asynchronous to other clocks in the design. These are not insurmountable issues -- you can create directed routing constraints (DiRt) to ensure the same routing is used each build. You can ensure safe clock-domain-crossing logic. However, this requires extra effort in design/sim/constraints. This is another issue -- that the fabric clocks appear easier to use. Add to this that they often work fine and they teach novice bad practices. The fabric generated clocks also can have additional jitter, duty-cycle distortion, etc... This generally isn't an issue as these clocks tend to be run at Fmax/10 or lower. For the original post, the synthesis tool generally is allowed to optimize the circuit. It is possible the tools will decide to share adder logic or other logic when it can detect mutual exclusion. The tools might opt to place the majority of the ALU into a DSP48 slice for example.
  15. Difference between BRAM, DRAm and DMA

    BRAM is "block ram" and is a fast and small, internal memory that can be accessed each cycle. DRAM is an external ram that is large, but has some overhead issues and also sends data back over multiple cycles. DMA is a scheme where a CPU can request the memory controller to move data from DRAM to/from another device in a short command. eg, if you need to send 1kB of data to a network card, the CPU issues only a few commands vs manually reading/writing every byte.