• Content Count

  • Joined

  • Last visited

About CurtP

  • Rank
  • Birthday 05/25/1986

Profile Information

  • Gender
  • Location
    Milwaukee, WI, USA
  1. Very good to know! I feel like there are a million little things I don't know about VHDL, and the worst part is that I don't know what I don't know, haha.
  2. tempResult is a 33-bit variable, used for convenience within the process. Its use is appropriately expanded upon elaboration and synthesis. It isn't used to register a value between cycles. The result signal becomes tempResult(31 downto 0) and the carry flag bit on the flag signal output becomes tempResult(32). Using a simple container module for clocking and I/O, I have synthesized and verified its operation on a Spartan7 board.
  3. I view signed and unsigned as just different ways to look at a bucket of bits. My current practice is to favor the use of std_logic_vector for moving and storing data, and converting to unsigned for arithmetic and logical operations. I try to minimize the use of casting or conversion to types that don't preserve 9-level logic, as this can hide problems in simulation that might pop up in implementation. So if I need to perform arithmetic on a set of std_logic_vectors, I convert them to unsigned, perform the operation, and then store the results as std_logic_vectors. Wherever possible, I try to design modules that output correct results without respect to sign, leaving it up to the user or higher-level functions to interpret the values as signed or unsigned. However I'm aware that some operations -must- be sign-aware (i.e. multiplication requiring sign extension). For example, the key line of code for the "add" pathway in my ALU is designed to perform sign-agnostic addition and subtraction based on a set of parameters: tempResult := unsigned("0" & operandA) + unsigned("0" & opB_adjusted) + unsigned'("" & (invert_opB xor tempCarry)); Where: tempResult -- self-explanatory operandA -- self-explanatory opB_adjusted -- operand B, which is either pre-inverted (logical NOT), or passed through unchanged based on the value of: invert_opB -- a std_logic parameter where '1' specifies to invert operand B and '0' specifies to pass it through tempCarry -- a std_logic value representing the carry-in ANDed with the carry_enable parameter Given that: - The the carry-in bit is represented as '1' = there was a carry or a borrow, and '0' = there was no carry or borrow, and - Two's complement negation is effectively one's complement negation + 1, and - A - B = A + -B The line of code above provides results with or without carry that will be correct regardless of interpretation as signed or unsigned. By XOR-ing invert_opB with tempCarry, that means if carry/borrow are enabled, the carry bit (if set) will effectively be added during addition with carry, or subtracted during subtraction with borrow (by essentially withholding the "+ 1" normally used in two's complement negation). If carry/borrow isn't enabled, tempCarry is guaranteed to be '0' by the earlier AND operation, and invert_opB will be '1' only if subtraction is being performed (completing the two's-complement negation). I'm sure this has been implemented in a better way by someone else, but it's the solution I've chosen for now at least.
  4. I'm going to heed the advice of pretty much everyone on this thread and tackle some more simple, self-contained projects to better learn the craft, and the quirks. It is important to understand the basic platform you're working on before developing something highly advanced on it. As for the advanced features and optimizations you mentioned, you're probably right to say that implementing them will be difficult, time-consuming and possibly not even necessary. But those very aspects are essentially the reason I'm doing this. The genesis of my fascination with CPU design comes from when I was much younger and found myself pouring over Intel architectural manuals and pretty much anything I could absorb about machine design. So these things like speculative and out-of-order execution, branch prediction, ILP, caching policies, and so forth, are core to my interest in this. I'd honestly rather try to implement them and fail, than sacrifice them and succeed. Though that doesn't preclude me from designing a much simpler design -first- and then getting more advanced from there. Thanks again to everyone in this thread for the conversation and advice!
  5. Only experience can teach what this truly is like, but I can say that this is pretty much what I was expecting. The people who succeed the most often fail the most, because they actually tried. Success is primarily a function of recovery from failure. But at least I have the benefit of being on my own timeline and own (non-)budget. Just me, my free time, an Arty S7, and my sanity (for now). That said, the prevailing advice seems to be to build smaller projects first. I initially had figured "well, the smaller functional units -are- smaller projects". But I can see why it would be advantageous to do something more fundamental first, since it'll give me a chance to learn more about completing a whole project, the subtleties of FPGA design, and working with I/O. I'll think of something neat to do (after all, I've got a bunch of PMODs sitting around waiting to be toyed with). If nothing else, I can build myself a debug framework over UART. Totally agree about the importance of small victories. Sometimes all it takes to keep your motivation going is to see your design blink some LEDs just like it was supposed to. Thanks for the words of advice!
  6. I get where you're coming from. If one isn't careful, they can extinguish their own enthusiasm for something by taking on way too big of a task, way too early. @[email protected] has strongly suggested that I build the peripherals for my CPU before building my CPU. This is probably the course of action I'll ultimately take (maybe building small parts of the CPU along the way). This approach could kill two birds with one stone: setting me up with a proper debug environment, and giving me some more simple projects to work on first.
  7. Thanks for getting back to me! Are there any specific pros/cons to registering the outputs of the register file that I should be aware of? This design is specifically targeted for use in a bespoke CPU design, not as a general purpose module to be re-purposed by others. So in the context of my own design, what should I worry about? All module outputs are sampled at the rising edge of the clock. Would leaving the outputs unregistered put my design at risk of sampling spurious/undefined values from the register file based on factors like clock skew or others? And thanks for the clarification on the 'last signal assignment wins' behavior. There are some cases where exploiting this trait is very useful, though you are correct that it may affect code clarity, especially for those not already well-acquainted with VHDL.
  8. This gives me a few things to think about. One of the most difficult tasks is discerning the base level functionality you have to provide for an OS and a software toolchain to function. For example, I do want to implement an MMU with some manner of paging mechanism, as well as user/supervisor execution mode (which will of course require the paging mechanism to check page descriptors for security-related attributes). As for unaligned accesses, the whole topic sounds messy and like it'd add a lot of complexity (and latency). I would like to avoid implementing it if possible. If there is a strong case for supporting unaligned accesses, what is it? This is a question I was pondering just the other day. Looking at your questions one by one: 1. the j?? instructions are indeed the conditional jump instructions. However, the ones I have planned for may not satisfy all of the comparisons you mentioned a compiler will want (at least not in a single conditional jump instruction). I have planned for the true/false versions testing each flag (flags are carry, zero, negative, overflow; so the conditional jump instructions are jc/jnc, jz/jnz, jn/jnn, jo/jno). I do not currently have a jsr/call type instruction planned. Correct me if I'm wrong here, but such an instruction struck me as unnecessarily CISC-y. Can't a compliler (or assembly programmer) just prepare the registers and stack according to their chosen calling convention, then execute a jump? I felt that it was unnecessary for my CPU to have to pipeline a series of operations just to set up a subroutine call for the programmer, especially considering that there are numerous calling conventions to choose from, with new ones being added periodically. 2. Multiply and divide are tentatively (but not firmly) planned. See that empty light blue row in my opcode map? That's where they'd go. Of course, implementing these has its own set of challenges. Not only are these complex, multi-cycle operations, but you also have to make arrangements to take operands of machine word size, and write back to a register pair rather than a single register (though accommodating this was one of my motivations for dual write ports on the register file). I do like the approach you mentioned regarding how to handle the signed/unsigned cases.. 3. My 'nop' instruction is located right next to xchg at opcode 11010. I had tentatively planned to implement it in the old Intel convention of having it essentially encode to "xchg r0,r0". But I have also given thought to how I might want to implement various kinds of explicit and implicit no-ops of varying types (cycle count, serializing or non-serializing, etc.). Do you have any specific recommendations in this regard? Sticking nop next to xchg in the opcode table was sort of my default answer. 4. My design is set up for 32-bit machine words operating on byte-addressable memory, but I hadn't currently planned for specifically loading only a byte out of memory. There are ways that I could go about implementing this within the structure of the current instruction format, but it will be more of a challenge to implement the halfword case than the byte case. If both are necessary though, I'll figure it out. 5. I'm a little confused by the challenge you're describing here. As you mentioned, the loading a value in to a register part is easily covered. But doesn't an unconditional jump instruction satisfy the requirement of loading a value to the program counter? I mean, effectively the unconditional jump is just a load to PC but with privilege checks. Also, it's very interesting you mentioned that you didn't have to implement push/pop. I assumed that a compiler would require these, at least for subroutine calls. You're saying that a compiler will manage a stack using only regular loads and stores? I'm inclined to implement push/pop if practical, because it's a very useful convenience at the assembler level if nothing else. As for int/iret, how did you manage to get a working design without these? I assumed that OSes would require the presence of software interrupt instructions, in order to facilitate user-mode code jumping to system subroutines. Absent this, what would an OS have the programmer do? It seems like the user-mode code would have to place parameters somewhere and then deliberately trigger an exception so that the OS could enter a service routine, read the parameters, and do its thing before returning. Thanks again for helping me answer some of the tougher questions early, on, rather than later when making changes gets exponentially harder. - Curt
  9. Hi Dan! Glad to meet you. I had already heard about the ZipCPU prior -- it's great to find the maker right here on the forum! Thanks for your insights and resources. It helps to hear from someone who has already gone through the CPU development process. You've certainly given me a lot to pour over! Especially helpful at this stage is your information about pipelining strategies. From my perspective, one of the most intellectually challenging tasks of the entire design is figuring out how to get the right data at the right stages of the pipeline at the right time, while minimizing stalls, maximizing utilization of functional units, and detecting data dependencies to avoid incorrect results. I have made this task (somewhat) easier by using a load/store architecture. All arithmetic/logic functions have register destinations, and register or immediate operands. I would like to implement instruction level parallelism, which of course means dependencies must be carefully tracked. I have not yet devised the method I will use to allow ILP, while preserving program order data integrity. I have spent some time reading about ROBs, register renaming, and reservation station methodologies, but still have a lot more to learn. I imagine that I will first implement a simple, in-order pipeline that stalls completely while awaiting data, and once I have that working well, I can add complexity from there. At this stage, I have: - Designed the initial version of the ISA, including the instruction format, instruction set, and (most of) the opcode/subfunction mappings. There is room to expand it, but currently all planned instructions can be mapped meaningfully to 4- or-8-byte instruction words. The second 4 byte block is only used when a full 32-bit immediate is specified. The initial 4-byte instruction word sets aside a byte for a small immediate value, which can be used for whatever the instruction might use one for (byte-sized ALU operand, loading the low byte of a register, short relative jump, software interrupt vector, etc.). Therefore, using good coding technique, most instructions can fit in to the machine word size of 32 bits. The format is split in to simple bit fields, allowing most decoding to be done with relatively simple combinational logic. - Designed what will likely be at least very close to the final version of the ALU. Using only 4 fundamental functional pathways (add,and,or,xor), it is able to implement 13 planned arithmetic/logical functions (and,test,xor,not,or,sub,cmp,neg,inc,sbb,adc,add,dec). - Designed the register file you see above. The reason behind having two write ports and four read ports is because of the instruction level parallelism I plan to implement. I imagine this design will probably change numerous times throughout the course of the project. If nothing else, I will have to add more physical registers, as I plan to implement ARM-style register context switching for interrupt routine servicing. - Designed the flags register (fairly straightforward). - Created a very early design for a (partial) instruction decoder. Currently, it can check validity and decode for any of the ALU-related instructions, outputting valid instructions in an internal format that would be useful for scheduling. In short, most of the work is still ahead of me. Especially when you count the validation / "FPGA Hell" phase. But hey, it's about the journey, right? I've included some of my early design notes, to give a little more clarity to the direction I've been heading. I'd be curious to hear your thoughts. Thanks again for your help and resources! - Curt
  10. Thank you for the kind words and encouragement! The first big project I'm working on is.. (wait for it).. a general purpose CPU. Cliche, yes, I know. But I see it as a labor of love, an opportunity to learn many principles of machine design at once, and just something I've been really curious to do for a long time. I'm sure the world doesn't need yet another new ISA, but it's still fun to create one. And I'll be open-sourcing the final product, for whoever might find it useful. One of my design goals (or I suppose you could call it a meta-design goal) is to prioritize the use of easy-to-read, behavioral VHDL so that people who read through the source can intuitively learn the ins and outs of how a CPU really works. I've looked at a lot of other CPU designs and found that it's often difficult to quickly discern what a portion of the code is actually doing and why, because of heavy use of structural and combinational syntax, which while generally more efficient, isn't as intuitive for a human to parse beyond a certain level of complexity (at least not for me). So will it be the best/fastest/most technically competent CPU? No. But you wouldn't get that without going the ASIC route anyway (and having a lot more resources and expertise than I do). But I do think I can make a CPU that also serves as a learning aid for others curious about the inner-workings of CPUs. Anyhow, I digress. Thanks again! - Curt
  11. I have found the schematic views of elaboration/synthesis/implementation to be very helpful for improving my understanding of VHDL and my target FPGA. It's one thing to see simulation results on a scope, but it's another to see the actual hardware that the tools generate from your VHDL. I try to ask myself as I go "what hardware will this create?", and am picking up best practices bit by bit. For every functional unit of a design that I create, I make sure to walk through the elaborated schematic and understand what each part is doing. I find that making small changes to the VHDL and seeing how it impacts the elaborated design is very useful.
  12. Thanks for the advice! My journey into hardware design has been like drinking from a firehose of information so far, but that's part of the fun. You're correct that being accustomed to software development paradigms can be an impediment to learning hardware design. I think a lot of people see the superficial similarity between HDLs and C-style languages and assume that the process will be similar. It has been an interesting exercise to rework my thinking around describing the behavior of a circuit, rather than listing a series of concurrent operations to be performed. I think one advantage I have going in is that digital logic and integrated circuits have been a fascination of mine since I was a kid. Long before I ever even knew about HDLs, I was pouring over Intel technical docs, and reading about the theory of machine design. Of course, once you dive in to actually designing a circuit, you quickly realize how much you -don't- know. But again, all part of the fun! I have spent some time reading through Xilinx's guidelines for synthesis, but I haven't invested in any actual books on hardware design. Are there any particular ones that you recommend? Thanks again, - Curt
  13. Thanks for the explanation!
  14. Thanks for all the info! My background is software, primarily C/C++, so I am still learning the stylistic conventions of VHDL and HDLs in general as I go. It helps to have other engineers point me in the right direction. A few questions, if you don't mind: When you say "infrastructure", are you referring to things like clock, reset, and other signals that propagate broadly through the design? Regarding distributed memory versus registers -- you're correct that this design infers registers upon elaboration. What kinds of tradeoffs are involved in choosing which design to pursue? This register file will be the main GP registers for a superscalar design, so I want to be able to write 2 registers and read 4 registers per clock (when made possible by the pipeline), and I would like to be able to read back a written register the cycle immediately after it was written, if possible. Of course, none of these things should come at the cost of potential data corruption. Regarding unregistered outputs -- is this because the outputs aren't in the clocked process? I had used this approach prior but noticed that this caused an extra cycle to elapse between when a register was written and when that same register's new value could be read back. Regarding the priority for data_inB if the same address is used -- is this behavior reliable under FPGA implementation? I had assumed that it would cause some sort of contention that would lead to undefined values. I've often heard the phrase "last signal assignment wins", but wasn't sure if that was something that merely happens in simulation, or if it was a reliable implemented behavior. Regarding the asserts -- thank you for the suggestion. Someone else also recommended this, and I will be implementing it for simulation. Thanks again for all the help! - Curt
  15. Hey everyone, I've done the initial design of a register file (16x 32-bit registers, two write ports, four read ports) in VHDL as part of a larger project, but seeing as I am a relative newcomer to HDLs, I was hoping to get some feedback on my design, any errors I may have made, or any improvements I might want to make. Here is the VHDL: -- Register file -- Two write ports, four read ports. -- For performance reasons, this register file does not check for the same -- register being written on both write ports in the same cycle. CPU control -- circuitry is responsible for preventing this condition from happening. library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use work.cpu1_globals_1.all; use work.func_pkg.all; entity registerFile is port ( clk : in std_logic; rst : in std_logic; writeEnableA : in std_logic; writeEnableB : in std_logic; readSelA, readSelB, readSelC, readSelD, writeSelA, writeSelB : in std_logic_vector(3 downto 0); data_inA, data_inB : in std_logic_vector(DATA_WIDTH - 1 downto 0); data_outA, data_outB, data_outC, data_outD : out std_logic_vector(DATA_WIDTH - 1 downto 0) ); end registerFile; architecture behavioral of registerFile is type regArray is array (0 to 15) of std_logic_vector(DATA_WIDTH - 1 downto 0); signal registers : regArray := (others => (others => '0')); begin data_outA <= registers(to_integer(unsigned(readSelA))); data_outB <= registers(to_integer(unsigned(readSelB))); data_outC <= registers(to_integer(unsigned(readSelC))); data_outD <= registers(to_integer(unsigned(readSelD))); registerFile_main : process(clk) begin if(rising_edge(clk)) then if(rst = '1') then registers <= (others => (others => '0')); else if(writeEnableA = '1') then registers(to_integer(unsigned(writeSelA))) <= data_inA; end if; if(writeEnableB = '1') then registers(to_integer(unsigned(writeSelB))) <= data_inB; end if; end if; end if; end process; end behavioral; This design is intended for use on FPGAs, hence the use of default values for the registers. I appreciate any feedback you might have! Thanks, - Curt