Feedback on a register file design?

CurtP · January 15, 2018

Hey everyone,

I've done the initial design of a register file (16x 32-bit registers, two write ports, four read ports) in VHDL as part of a larger project, but seeing as I am a relative newcomer to HDLs, I was hoping to get some feedback on my design, any errors I may have made, or any improvements I might want to make.

Here is the VHDL:

-- Register file

-- Two write ports, four read ports.
-- For performance reasons, this register file does not check for the same 
-- register being written on both write ports in the same cycle. CPU control
-- circuitry is responsible for preventing this condition from happening.

library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;

use work.cpu1_globals_1.all;
use work.func_pkg.all;

entity registerFile is
    port
    (
        clk : in std_logic;
        rst : in std_logic;
        writeEnableA : in std_logic;
        writeEnableB : in std_logic;
        readSelA, readSelB, readSelC, readSelD, writeSelA, writeSelB : in std_logic_vector(3 downto 0);
        data_inA, data_inB : in std_logic_vector(DATA_WIDTH - 1 downto 0);
        data_outA, data_outB, data_outC, data_outD : out std_logic_vector(DATA_WIDTH - 1 downto 0)
    );
end registerFile;

architecture behavioral of registerFile is

    type regArray is array (0 to 15) of std_logic_vector(DATA_WIDTH - 1 downto 0);
    signal registers : regArray := (others => (others => '0'));

begin

    data_outA <= registers(to_integer(unsigned(readSelA)));
    data_outB <= registers(to_integer(unsigned(readSelB)));
    data_outC <= registers(to_integer(unsigned(readSelC)));
    data_outD <= registers(to_integer(unsigned(readSelD)));

    registerFile_main : process(clk)
    
    begin
    
        if(rising_edge(clk)) then
        
            if(rst = '1') then
        	
                registers <= (others => (others => '0'));
                
            else
                
                if(writeEnableA = '1') then
                    registers(to_integer(unsigned(writeSelA))) <= data_inA;
                end if;
            	
                if(writeEnableB = '1') then
                    registers(to_integer(unsigned(writeSelB))) <= data_inB;
                end if;
                
            end if;
            
        end if;
        
    end process;
    
end behavioral;

This design is intended for use on FPGAs, hence the use of default values for the registers.

I appreciate any feedback you might have!

Thanks,

- Curt

D@n · January 17, 2018

@Piasa,

6 minutes ago, Piasa said:

IMO, devs tend to underestimate the FPGA in some ways which results in excessive pre-optimization.

Fascinating comment.

Care to elaborate?

Dan

Piasa · January 17, 2018

29 minutes ago, D@n said:

assuming the block RAM's even supported it.

BRAM does, but then you can't read in the same cycle. A Xilinx targeted register file could use DMEM. This requires four copies of each register for the 4-6 read/cycle case, or two copies for the 1-3 read/cycle case. The DMEM have a 3-read, 1-write config. To get 4-6 ports means two copies. To get the two writes again means two copies and the addition of a small tag ram which can be implemented using registers.

It is debatable if this is that much better as both should be small for modern FPGAs. It removes the input muxes for the priority logic as portA now only writes to two DMEM and portB only writes to the others. It also removes the output muxes as these are built into the DMEM. The clock to out is higher than registers, but I'm not sure if it is higher than register + LUT6. The complexity is that the OS either needs to clear the registers at start, or accept the bootup values could be random. Also, the bits per slice is lower, but given the lack of extra muxes this is probably not a concern.

There are also benefits if the design can use either 32 registers or can use two sets of registers as the DMEM is a 32b config. This can be used for fast interrupt context swaps or for barrel processors.

In terms of coherency, that is based on if the CPU can ensure it never has a write-write conflict and also can avoid read-before-write where write could now be from two ports.

3 minutes ago, D@n said:

Fascinating comment.

Care to elaborate?

In one design, there was a custom written 32b adder that was instantiated in the same file as a 40+ bit adder that ran at the same clock. (neither adder was in the top 100 nets and the design met timing). One of these took hours to write, the other took seconds. I also notice some people add lots of pipeline stages for simple calculations. This can be fine, but each pipeline stage increases the chance that a future modification will have a pipeline error. Because this might only show up in rare case, I take steps to ensure the pipeline naming scheme and intent is clear /wrt cycle vs sample delays. This is especially true when simplified assumptions about the pipeline are no longer true when the module is ported to a new application. This commentary is probably best suited for another thread as it does not related to the topic of a CPU register file.

zygot · January 17, 2018

All of the comments have been interesting and intriguing; for those with a good base of experience. I've been avoiding commenting on specific concepts as details like this might be more confusing noise rather than helpful to the (even well prepared) beginner. There's plenty to time to hone one's expertise for those venturing into more complicated and sophisticated projects with high reliability requirements. As one becomes more knowledgeable about failure mechanisms even seasoned professionals can loose sleep.... I say this from experience. It seems to me... and I admit that my mind works in unusual ways..., that newbies have enough to focus on just to get reasonably competent ( there's the concepts, the tools, the device architecture, etc, etc. ) without having their focus shifted to more esoteric concepts. Having a glimpse of the complexities is not bad... but...

It's easy to attempt to convey a valid idea and end up just peddling confusion to the mind of the intended audience. Perhaps I'm being a bit too self-reflective this week.

CurtP · January 17, 2018

15 hours ago, D@n said:

@zygot,

Oh, I'm around, but you guys have given me a lot of reading material to go through. Well, that and I haven't been waiting on the synthesizer as much so I haven't been hitting reload on the Digilent forum page as often.

@CurtP,

What @zygot is trying to point out is that I've built my own CPU, the ZipCPU, as a similar labor of love. It's not a forth machine, but a basic sixteen 32-bit register design, with two register sets of that size. It's also small enough to fit on Digilent's CMod S6 while running a small O/S. I'd love to offer you an example register file module, however my own register "file" never managed to get separated out from the rest of the CPU such as you are trying to do. I struggled to find a clean way to do so, and so didn't. If you are curious, you can search the main CPU source and look for "regset" and see how I handled it.

In particular, the ZipCPU register file accepts only one write to the register file per clock--not two. I was convinced that two writes per clock would leave the CPU vulnerable to coherency problems--assuming the block RAM's even supported it. This register set supports one write and three reads per clock. Two of those reads are for an instruction, the third is to support the debug port. (You are thinking about how to debug your CPU already, aren't you?)

I've also written several instructional blog's on this and similar topics. These cover my view that you should start building a CPU by first building its peripherals, then a debug port to access and test the peripherals, before starting on the CPU itself. Further blog articles discuss how to build a debugging port into the CPU, how to debug the CPU from that port when using a simulator as well as when online. I've discussed pipelining strategies, and presented how the ZipCPU pipeline strategy works. More recently, I've been working with formal methods. I've therefore presented a demonstration of how formal methods can be used to verify that a bus component works as designed, and then offered a simple prefetch as an example. I'm hoping to post again regarding how to build an instruction prefetch and cache, as well as how to formally verify that such a module works, but I haven't managed to clean up my code enough to present it in spite of presenting why such a proof would be so valuable.

While I didn't use formal methods to build the CPU initially, I've been finding more bugs using formal methods than I had otherwise, so you might say that I've become a believer.

As a result, I'm right now in the process of formally verifying as much of the CPU's modules as I can. I've managed to formally verify three separate prefetch modules, including the one with a cache, the memory access components, the instruction decoder. I've also managed to formally verify several CPU related peripheral components, such as the (yet to be integrated) MMU, counters, timers, an interrupt controller, bus arbiters, bus delay components and more. This has been my current focus with the CPU. Once I finish it, I'm hoping to write about how to use the ZipCPU in case others are interested (and I know they are).

I know @zygot dislikes my blog, but you might find a lot of useful information available there to describe the things you've discussed above.

Dan

Hi Dan! Glad to meet you. I had already heard about the ZipCPU prior -- it's great to find the maker right here on the forum!

Thanks for your insights and resources. It helps to hear from someone who has already gone through the CPU development process. You've certainly given me a lot to pour over!

Especially helpful at this stage is your information about pipelining strategies. From my perspective, one of the most intellectually challenging tasks of the entire design is figuring out how to get the right data at the right stages of the pipeline at the right time, while minimizing stalls, maximizing utilization of functional units, and detecting data dependencies to avoid incorrect results. I have made this task (somewhat) easier by using a load/store architecture. All arithmetic/logic functions have register destinations, and register or immediate operands.

I would like to implement instruction level parallelism, which of course means dependencies must be carefully tracked. I have not yet devised the method I will use to allow ILP, while preserving program order data integrity. I have spent some time reading about ROBs, register renaming, and reservation station methodologies, but still have a lot more to learn. I imagine that I will first implement a simple, in-order pipeline that stalls completely while awaiting data, and once I have that working well, I can add complexity from there.

At this stage, I have:

- Designed the initial version of the ISA, including the instruction format, instruction set, and (most of) the opcode/subfunction mappings. There is room to expand it, but currently all planned instructions can be mapped meaningfully to 4- or-8-byte instruction words. The second 4 byte block is only used when a full 32-bit immediate is specified. The initial 4-byte instruction word sets aside a byte for a small immediate value, which can be used for whatever the instruction might use one for (byte-sized ALU operand, loading the low byte of a register, short relative jump, software interrupt vector, etc.). Therefore, using good coding technique, most instructions can fit in to the machine word size of 32 bits. The format is split in to simple bit fields, allowing most decoding to be done with relatively simple combinational logic.

- Designed what will likely be at least very close to the final version of the ALU. Using only 4 fundamental functional pathways (add,and,or,xor), it is able to implement 13 planned arithmetic/logical functions (and,test,xor,not,or,sub,cmp,neg,inc,sbb,adc,add,dec).

- Designed the register file you see above. The reason behind having two write ports and four read ports is because of the instruction level parallelism I plan to implement. I imagine this design will probably change numerous times throughout the course of the project. If nothing else, I will have to add more physical registers, as I plan to implement ARM-style register context switching for interrupt routine servicing.

- Designed the flags register (fairly straightforward).

- Created a very early design for a (partial) instruction decoder. Currently, it can check validity and decode for any of the ALU-related instructions, outputting valid instructions in an internal format that would be useful for scheduling.

In short, most of the work is still ahead of me. Especially when you count the validation / "FPGA Hell" phase. But hey, it's about the journey, right?

I've included some of my early design notes, to give a little more clarity to the direction I've been heading. I'd be curious to hear your thoughts.

Thanks again for your help and resources!

- Curt

D@n · January 18, 2018

@CurtP,

Simple pipeline's aren't. Indeed, debugging the pipeline with all of its corner cases has been a challenge for me and I just wanted to build the simplest pipeline I could. You might wish to start planning for this ahead of time, since I was perpetually surprised by little nuances I wasn't expecting. I mean, seriously, who would ever load a register from a value pointed to by the same register? "LOD (R0),R0" ... it doesn't make sense, why would you do that? Well, GCC created code that did that which my CPU then needed to accommodate.

If you are interested in register renaming and/or out of order execution and stuff ... think now, before you start, about how you wish to represent the state information from within your CPU as you debug it. This will be important to you. Without a good way to view and inspect the problem, you won't be able to move forward to working code.

Will you be supporting unaligned instructions? Classical RISC ISA's don't, but it's something to consider.

When I was designing my own instruction set, the requirement of only writing one register to the register set at a time prevented me from implementing such instructions as push/pop or iret. In hind sight, GCC handled the missing push/pop so well you'd hardly know they are missing. Indeed, the CPU is probably faster as a result.

Oh, I should mention regarding flags ... GCC (or any C compiler for that matter) will want the ability to compare and branch off of any signed or unsigned comparison. That's =, !=, <, >, <=. and >=. In other words, you will need to support (somehow) 11 conditions. The ZipCPU sort of cheats and supports only 7 of these, but it's something to remember. Also, the flags can be a hassle to get the sign bit and overflow bit right. Don't forget to adjust the sign bit to keep it correct in case of overflow, or your extreme comparisons won't work.

Looking over your ISA, I noticed ...

You don't seem to have any conditional branch instructions. Are these the j?? instructions? Do you have a JSR instruction?
I don't see any multiply or divide instructions. I didn't have multiply or divide instructions in my first iteration, and needed to come back and add them in. The ones I now have are three 32x32 bit multiplies returning the top 32 bits if signed, the top 32 bits if unsigned, and the bottom 32-bits. I've also got two 32x32-bit divide instructions, one signed and one unsigned. The compiler would love me to have a remainder function, or even a 64x32 divide, but in the ZipCPU architecture those require some software to accomplish.
I didn't see any NOOP instruction. That was another afterthought instruction of my. Sure, you could move register A to register A, but such an instruction might stall waiting for A to become available, whereas the NOOP doesn't need to read any instructions.
How about that memory access: will your ISA allow 8-bit byte access to memory? I had to come back and add byte and halfword instructions into my ISA as an afterthought, when I couldn't get the C-library to compile without them.
While from your description it doesn't sound like you'll struggle from this, I had to wrestle with the realities of linking when I first discovered how a linker worked. There are two basic instructions the linker wants to adjust: load a value into a register, and jump to a location. The first one was fairly easy, I took two instructions and I could load any value into any general purpose register. The second one was harder, but I eventually wrote something similar to what you've described above. I consider this the LOD (PC),PC instruction--or load the value at the next memory address in the instruction stream into the PC. It's the only instruction I have like it, as all my other instructions fit into 32'bit words with no immediate's following.

If you are interested, you can see my own instruction cheat sheet here, or a longer discussion of the ISA here.

Good luck! Holler if you get stuck, or when you discover you can't get as far in VHDL as I did in Verilog ...

Dan

zygot · January 18, 2018

1 hour ago, D@n said:

Holler if you get stuck, or when you discover you can't get as far in VHDL as I did in Verilog

Really?? You you just couldn't stop yourself ?

I do like the idea of letting C or Python assist with the tedious bookkeeping for things like pipelines. Anyway, my work is done... two minds with a common affliction. Nice! I'll just stick to Zinq when required and state machines or a simple real-time controller where I can. After I designed my last controller what stopped me from publishing it was that I couldn't figure out an application that justified its existence. Still the exercise was both maddening (at times) and fun ( most of the time).

Piasa · January 18, 2018

On 1/16/2018 at 7:14 AM, CurtP said:

When you say "infrastructure", are you referring to things like clock, reset, and other signals that propagate broadly through the design?

Regarding distributed memory versus registers -- you're correct that this design infers registers upon elaboration. What kinds of tradeoffs are involved in choosing which design to pursue? This register file will be the main GP registers for a superscalar design, so I want to be able to write 2 registers and read 4 registers per clock (when made possible by the pipeline), and I would like to be able to read back a written register the cycle immediately after it was written, if possible. Of course, none of these things should come at the cost of potential data corruption.

Regarding unregistered outputs -- is this because the outputs aren't in the clocked process? I had used this approach prior but noticed that this caused an extra cycle to elapse between when a register was written and when that same register's new value could be read back.

Regarding the priority for data_inB if the same address is used -- is this behavior reliable under FPGA implementation? I had assumed that it would cause some sort of contention that would lead to undefined values. I've often heard the phrase "last signal assignment wins", but wasn't sure if that was something that merely happens in simulation, or if it was a reliable implemented behavior.

Just realized I didn't respond to this yesterday.

By infrastructure, I mean clocks and general resets that are used at startup or otherwise not used during normal operating conditions.

The distributed memory approach was presented a few posts up. It doesn't make sense for your goal of an easy to read, general implementation.

In many designs, it is preferred to have output registers. This does add the 1 register delay. In some cases the unregistered outputs are unavoidable. When possible, having output registers is nice because the longest path won't be half in one file and half in another. For re-use, remember that you can't control how other people use your module.

"last signal wins" is a reliable behavior. That said, it can be abused. There are structured uses where it can be very useful. However, it creates a bottom-to-top priority structure. For this reason, it should only be used in a manner that is unlikely to confuse a reader.

xc6lx45 · January 18, 2018

5 hours ago, D@n said:

...I mean, seriously, who would ever load a register from a value pointed to by the same register? "LOD (R0),R0" ... it doesn't make sense, why would you do that? Well, GCC created code that did that which my CPU then needed to accommodate.

Not saying I would have thought of it (but then I don't design compilers & CPUs).

It might be just the following code:

int* a = (something);
int b = *a; // and from now on, a is never used again

The lifetime of b begins when a ends, so a and b get optimized into the same register.

Piasa · January 18, 2018

Y

1 hour ago, xc6lx45 said:

The lifetime of b begins when a ends, so a and b get optimized into the same register.

There are several common things that should result in this. Linked-lists are one example, but there are probably a dozen others. Of course that is off topic from a code review of a register file for a cpu that doesn't currently have a C/C++ compiler.

CurtP · January 18, 2018

13 hours ago, D@n said:

@CurtP,

Simple pipeline's aren't. Indeed, debugging the pipeline with all of its corner cases has been a challenge for me and I just wanted to build the simplest pipeline I could. You might wish to start planning for this ahead of time, since I was perpetually surprised by little nuances I wasn't expecting. I mean, seriously, who would ever load a register from a value pointed to by the same register? "LOD (R0),R0" ... it doesn't make sense, why would you do that? Well, GCC created code that did that which my CPU then needed to accommodate.

If you are interested in register renaming and/or out of order execution and stuff ... think now, before you start, about how you wish to represent the state information from within your CPU as you debug it. This will be important to you. Without a good way to view and inspect the problem, you won't be able to move forward to working code.

Will you be supporting unaligned instructions? Classical RISC ISA's don't, but it's something to consider.

When I was designing my own instruction set, the requirement of only writing one register to the register set at a time prevented me from implementing such instructions as push/pop or iret. In hind sight, GCC handled the missing push/pop so well you'd hardly know they are missing. Indeed, the CPU is probably faster as a result.

Oh, I should mention regarding flags ... GCC (or any C compiler for that matter) will want the ability to compare and branch off of any signed or unsigned comparison. That's =, !=, <, >, <=. and >=. In other words, you will need to support (somehow) 11 conditions. The ZipCPU sort of cheats and supports only 7 of these, but it's something to remember. Also, the flags can be a hassle to get the sign bit and overflow bit right. Don't forget to adjust the sign bit to keep it correct in case of overflow, or your extreme comparisons won't work.

Looking over your ISA, I noticed ...

You don't seem to have any conditional branch instructions. Are these the j?? instructions? Do you have a JSR instruction?

I don't see any multiply or divide instructions. I didn't have multiply or divide instructions in my first iteration, and needed to come back and add them in. The ones I now have are three 32x32 bit multiplies returning the top 32 bits if signed, the top 32 bits if unsigned, and the bottom 32-bits. I've also got two 32x32-bit divide instructions, one signed and one unsigned. The compiler would love me to have a remainder function, or even a 64x32 divide, but in the ZipCPU architecture those require some software to accomplish.

I didn't see any NOOP instruction. That was another afterthought instruction of my. Sure, you could move register A to register A, but such an instruction might stall waiting for A to become available, whereas the NOOP doesn't need to read any instructions.

How about that memory access: will your ISA allow 8-bit byte access to memory? I had to come back and add byte and halfword instructions into my ISA as an afterthought, when I couldn't get the C-library to compile without them.

While from your description it doesn't sound like you'll struggle from this, I had to wrestle with the realities of linking when I first discovered how a linker worked. There are two basic instructions the linker wants to adjust: load a value into a register, and jump to a location. The first one was fairly easy, I took two instructions and I could load any value into any general purpose register. The second one was harder, but I eventually wrote something similar to what you've described above. I consider this the LOD (PC),PC instruction--or load the value at the next memory address in the instruction stream into the PC. It's the only instruction I have like it, as all my other instructions fit into 32'bit words with no immediate's following.

If you are interested, you can see my own instruction cheat sheet here, or a longer discussion of the ISA here.

Good luck! Holler if you get stuck, or when you discover you can't get as far in VHDL as I did in Verilog ...

Dan

This gives me a few things to think about. One of the most difficult tasks is discerning the base level functionality you have to provide for an OS and a software toolchain to function. For example, I do want to implement an MMU with some manner of paging mechanism, as well as user/supervisor execution mode (which will of course require the paging mechanism to check page descriptors for security-related attributes).

As for unaligned accesses, the whole topic sounds messy and like it'd add a lot of complexity (and latency). I would like to avoid implementing it if possible. If there is a strong case for supporting unaligned accesses, what is it? This is a question I was pondering just the other day.

Looking at your questions one by one:

1. the j?? instructions are indeed the conditional jump instructions. However, the ones I have planned for may not satisfy all of the comparisons you mentioned a compiler will want (at least not in a single conditional jump instruction). I have planned for the true/false versions testing each flag (flags are carry, zero, negative, overflow; so the conditional jump instructions are jc/jnc, jz/jnz, jn/jnn, jo/jno).

I do not currently have a jsr/call type instruction planned. Correct me if I'm wrong here, but such an instruction struck me as unnecessarily CISC-y. Can't a compliler (or assembly programmer) just prepare the registers and stack according to their chosen calling convention, then execute a jump? I felt that it was unnecessary for my CPU to have to pipeline a series of operations just to set up a subroutine call for the programmer, especially considering that there are numerous calling conventions to choose from, with new ones being added periodically.

2. Multiply and divide are tentatively (but not firmly) planned. See that empty light blue row in my opcode map? That's where they'd go. Of course, implementing these has its own set of challenges. Not only are these complex, multi-cycle operations, but you also have to make arrangements to take operands of machine word size, and write back to a register pair rather than a single register (though accommodating this was one of my motivations for dual write ports on the register file). I do like the approach you mentioned regarding how to handle the signed/unsigned cases..

3. My 'nop' instruction is located right next to xchg at opcode 11010. I had tentatively planned to implement it in the old Intel convention of having it essentially encode to "xchg r0,r0". But I have also given thought to how I might want to implement various kinds of explicit and implicit no-ops of varying types (cycle count, serializing or non-serializing, etc.). Do you have any specific recommendations in this regard? Sticking nop next to xchg in the opcode table was sort of my default answer.

4. My design is set up for 32-bit machine words operating on byte-addressable memory, but I hadn't currently planned for specifically loading only a byte out of memory. There are ways that I could go about implementing this within the structure of the current instruction format, but it will be more of a challenge to implement the halfword case than the byte case. If both are necessary though, I'll figure it out.

5. I'm a little confused by the challenge you're describing here. As you mentioned, the loading a value in to a register part is easily covered. But doesn't an unconditional jump instruction satisfy the requirement of loading a value to the program counter? I mean, effectively the unconditional jump is just a load to PC but with privilege checks.

Also, it's very interesting you mentioned that you didn't have to implement push/pop. I assumed that a compiler would require these, at least for subroutine calls. You're saying that a compiler will manage a stack using only regular loads and stores? I'm inclined to implement push/pop if practical, because it's a very useful convenience at the assembler level if nothing else. As for int/iret, how did you manage to get a working design without these? I assumed that OSes would require the presence of software interrupt instructions, in order to facilitate user-mode code jumping to system subroutines. Absent this, what would an OS have the programmer do? It seems like the user-mode code would have to place parameters somewhere and then deliberately trigger an exception so that the OS could enter a service routine, read the parameters, and do its thing before returning.

Thanks again for helping me answer some of the tougher questions early, on, rather than later when making changes gets exponentially harder.

- Curt

D@n · January 18, 2018

@CurtP,

Let's see ...

The ZipCPU doesn't officially implement a JSR instruction either, even though the compiler *really* wants one. To deal with this case, I taught the assembler and disassembler that a particular two instruction combination was the JSR instruction: MOV 2+PC,R0 followed by JMP <address>. Typically, this was implemented as a long jump to the address, since the assembler never knew where the address would be 'til link time, and the linker wanted to place a 32-bit address into the instruction stream somewhere. As I mentioned before, my long jumps were implemented by loading the value following the current instruction word into the PC, and woodenly encoded as LW (PC),PC.

Actually ... the ZipCPU doesn't even have jump instructions per se, but the assembler hides this lack. The ADD instruction provides the other alternative: ADD.C <offset>,PC adds, if the condition C is true, the given offset to the PC. The assembler will quietly turn BRA, BNZ, BLT, etc. into this instruction if the target fits, and the disassembler replaces these instructions with their Bxx equivalents.

The C-library will require sub-word addressable memory for its string operations. Plan on needing arbitrary 16-bit and 8-bit load and store capability, or giving up on the C-library and implementing portable code.

An unconditional jump does need the capability to load an arbitrary value into the PC, yes. At issue, though, is how you will come back to your machine code and place that address into your instruction stream after compilation and assembly have both finished without knowing what the value should be. GNU's binutils helps, but you'll still need to write the hooks for your own processor.

So, moving on to push and pop. The most common case for these routines is when you want to add (or remove) an item from the stack. In my case, GCC calculates the stack size ahead of time, and then subtracts the stack size for the whole routine upon startup. Any register saves will be immediately placed into known positions on the stack afterward. Hence, the startup for a subroutine might look like:

subroutine:
  SUB 24,SP
  STO R0,(SP)
  STO R1,4(SP)
  STO R2,8(SP)
  STO R3,12(SP)
  ... compiler generated user code goes here
  LOD R0,(SP)
  LOD R1,4(SP)
  LOD R2,8(SP)
  LOD R3,12(SP)
  JMP R0 ; This is the ZipCPU's return instruction

The neat thing about how I've set up the bus is that only the first of these loads or stores will cost any bus delays. The second and subsequent (in any string of them) will cost only one additional clock--depending, of course, on the speed of the memory at the other end.

For INT/IRET instructions ... the ZipCPU supports two modes a user mode (where interrupts are enabled) and a supervisor mode (where interrupts are disabled). On an interrupt or an exception, the CPU just switches register sets in order to switch modes. The actual mode is kept in the flags register, so any write that changes this mode will cause the CPU to switch modes and hence register sets. Incidentally, this makes it *really* easy to write interrupt routines: they are just written in "C" as part of the supervisor code. When the supervisor is ready to switch to the user mode, it just issues a zip_rtu() command. This turns into an OR 0x100,CC instruction which turns on the interrupt enabled bit and the CPU switches modes. Incidentally ... getting the pipeline working for this, including all of the corner cases, was a real pain in the bitstream.

To implement a system call, I'd just call a function. That function would contain the one assembler instruction, "LDI 0,CC", which would then disable interrupts, switching the CPU to supervisor mode--leaving all the user registers intact as though the function were actually called. From supervisor mode, the software can do what it then likes with those register values. There are other possibilities for entering supervisor mode as well. For example, a division by zero error, hitting a debugging break point, at the conclusion of a single-stepped instruction, on a bus error, after hitting an illegal instruction, trying to execute an instruction from non-existent memory, etc.

When the supervisor code has dealt with whatever the exception was, it just calls zip_rtu() which executes a built-in RTU (return-to-userspace) instruction. There are other built-ins to help out as well, such as zip_save_context(contextp); which stores the user registers into the array pointed by contextp and zip_restore_context(contextp) which does the reverse, etc. Hence, to swap tasks, you set a timer interrupt. When that interrupt goes off, you save the registers into an array associated with the current task, and then load the registers from the task you want to switch to. Once you then return to userspace, the task swap is complete.

Still, the "tough" question early on is: how will you simulate your design, how will you visualize your pipeline, and how will you debug your software (and CPU) once you move to the actual hardware. These are the real questions you need to answer up front and immediately. Everything else follows from the answers you give to these questions.

Dan

CurtP · January 18, 2018

10 hours ago, Piasa said:

Just realized I didn't respond to this yesterday.

By infrastructure, I mean clocks and general resets that are used at startup or otherwise not used during normal operating conditions.

The distributed memory approach was presented a few posts up. It doesn't make sense for your goal of an easy to read, general implementation.

In many designs, it is preferred to have output registers. This does add the 1 register delay. In some cases the unregistered outputs are unavoidable. When possible, having output registers is nice because the longest path won't be half in one file and half in another. For re-use, remember that you can't control how other people use your module.

"last signal wins" is a reliable behavior. That said, it can be abused. There are structured uses where it can be very useful. However, it creates a bottom-to-top priority structure. For this reason, it should only be used in a manner that is unlikely to confuse a reader.

Thanks for getting back to me!

Are there any specific pros/cons to registering the outputs of the register file that I should be aware of? This design is specifically targeted for use in a bespoke CPU design, not as a general purpose module to be re-purposed by others. So in the context of my own design, what should I worry about? All module outputs are sampled at the rising edge of the clock. Would leaving the outputs unregistered put my design at risk of sampling spurious/undefined values from the register file based on factors like clock skew or others?

And thanks for the clarification on the 'last signal assignment wins' behavior. There are some cases where exploiting this trait is very useful, though you are correct that it may affect code clarity, especially for those not already well-acquainted with VHDL.

Feedback on a register file design?

Question

Link to comment

Share on other sites

62 answers to this question

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived