• 0

Feedback on a register file design?


Go to solution Solved by Piasa,

Question

Hey everyone,

I've done the initial design of a register file (16x 32-bit registers, two write ports, four read ports) in VHDL as part of a larger project, but seeing as I am a relative newcomer to HDLs, I was hoping to get some feedback on my design, any errors I may have made, or any improvements I might want to make.

Here is the VHDL:

-- Register file

-- Two write ports, four read ports.
-- For performance reasons, this register file does not check for the same 
-- register being written on both write ports in the same cycle. CPU control
-- circuitry is responsible for preventing this condition from happening.

library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;

use work.cpu1_globals_1.all;
use work.func_pkg.all;

entity registerFile is
    port
    (
        clk : in std_logic;
        rst : in std_logic;
        writeEnableA : in std_logic;
        writeEnableB : in std_logic;
        readSelA, readSelB, readSelC, readSelD, writeSelA, writeSelB : in std_logic_vector(3 downto 0);
        data_inA, data_inB : in std_logic_vector(DATA_WIDTH - 1 downto 0);
        data_outA, data_outB, data_outC, data_outD : out std_logic_vector(DATA_WIDTH - 1 downto 0)
    );
end registerFile;

architecture behavioral of registerFile is

    type regArray is array (0 to 15) of std_logic_vector(DATA_WIDTH - 1 downto 0);
    signal registers : regArray := (others => (others => '0'));

begin

    data_outA <= registers(to_integer(unsigned(readSelA)));
    data_outB <= registers(to_integer(unsigned(readSelB)));
    data_outC <= registers(to_integer(unsigned(readSelC)));
    data_outD <= registers(to_integer(unsigned(readSelD)));

    registerFile_main : process(clk)
    
    begin
    
        if(rising_edge(clk)) then
        
            if(rst = '1') then
        	
                registers <= (others => (others => '0'));
                
            else
                
                if(writeEnableA = '1') then
                    registers(to_integer(unsigned(writeSelA))) <= data_inA;
                end if;
            	
                if(writeEnableB = '1') then
                    registers(to_integer(unsigned(writeSelB))) <= data_inB;
                end if;
                
            end if;
            
        end if;
        
    end process;
    
end behavioral;

This design is intended for use on FPGAs, hence the use of default values for the registers.

I appreciate any feedback you might have!

Thanks,

 - Curt

Link to post
Share on other sites

Recommended Posts

  • 0
15 hours ago, [email protected] said:

@zygot,

Oh, I'm around, but you guys have given me a lot of reading material to go through.  Well, that and I haven't been waiting on the synthesizer as much so I haven't been hitting reload on the Digilent forum page as often.

@CurtP,

What @zygot is trying to point out is that I've built my own CPU, the ZipCPU, as a similar labor of love.  It's not a forth machine, but a basic sixteen 32-bit register design, with two register sets of that size.  It's also small enough to fit on Digilent's CMod S6 while running a small O/S.  I'd love to offer you an example register file module, however my own register "file" never managed to get separated out from the rest of the CPU such as you are trying to do.  I struggled to find a clean way to do so, and so didn't.  If you are curious, you can search the main CPU source and look for "regset" and see how I handled it.

In particular, the ZipCPU register file accepts only one write to the register file per clock--not two.  I was convinced that two writes per clock would leave the CPU vulnerable to coherency problems--assuming the block RAM's even supported it.  This register set supports one write and three reads per clock.  Two of those reads are for an instruction, the third is to support the debug port.  (You are thinking about how to debug your CPU already, aren't you?)

I've also written several instructional blog's on this and similar topics.  These cover my view that you should start building a CPU by first building its peripherals, then a debug port to access and test the peripherals, before starting on the CPU itself.  Further blog articles discuss how to build a debugging port into the CPU, how to debug the CPU from that port when using a simulator as well as when online.  I've discussed pipelining strategies, and presented how the ZipCPU pipeline strategy works.  More recently, I've been working with formal methods.  I've therefore presented a demonstration of how formal methods can be used to verify that a bus component works as designed, and then offered a simple prefetch as an example.  I'm hoping to post again regarding how to build an instruction prefetch and cache, as well as how to formally verify that such a module works, but I haven't managed to clean up my code enough to present it in spite of presenting why such a proof would be so valuable.

While I didn't use formal methods to build the CPU initially, I've been finding more bugs using formal methods than I had otherwise, so you might say that I've become a believer.

As a result, I'm right now in the process of formally verifying as much of the CPU's modules as I can.  I've managed to formally verify three separate prefetch modules, including the one with a cache, the memory access components, the instruction decoder.  I've also managed to formally verify several CPU related peripheral components, such as the (yet to be integrated) MMU, counters, timers, an interrupt controller, bus arbiters, bus delay components and more.  This has been my current focus with the CPU.  Once I finish it, I'm hoping to write about how to use the ZipCPU in case others are interested (and I know they are).

I know @zygot dislikes my blog, but you might find a lot of useful information available there to describe the things you've discussed above.

Dan

 

Hi Dan! Glad to meet you. I had already heard about the ZipCPU prior -- it's great to find the maker right here on the forum!

Thanks for your insights and resources. It helps to hear from someone who has already gone through the CPU development process. You've certainly given me a lot to pour over! :)

Especially helpful at this stage is your information about pipelining strategies. From my perspective, one of the most intellectually challenging tasks of the entire design is figuring out how to get the right data at the right stages of the pipeline at the right time, while minimizing stalls, maximizing utilization of functional units, and detecting data dependencies to avoid incorrect results. I have made this task (somewhat) easier by using a load/store architecture. All arithmetic/logic functions have register destinations, and register or immediate operands.

I would like to implement instruction level parallelism, which of course means dependencies must be carefully tracked. I have not yet devised the method I will use to allow ILP, while preserving program order data integrity. I have spent some time reading about ROBs, register renaming, and reservation station methodologies, but still have a lot more to learn. I imagine that I will first implement a simple, in-order pipeline that stalls completely while awaiting data, and once I have that working well, I can add complexity from there.

At this stage, I have:

 - Designed the initial version of the ISA, including the instruction format, instruction set, and (most of) the opcode/subfunction mappings. There is room to expand it, but currently all planned instructions can be mapped meaningfully to 4- or-8-byte instruction words. The second 4 byte block is only used when a full 32-bit immediate is specified. The initial 4-byte instruction word sets aside a byte for a small immediate value, which can be used for whatever the instruction might use one for (byte-sized ALU operand, loading the low byte of a register, short relative jump, software interrupt vector, etc.). Therefore, using good coding technique, most instructions can fit in to the machine word size of 32 bits. The format is split in to simple bit fields, allowing most decoding to be done with relatively simple combinational logic.

 - Designed what will likely be at least very close to the final version of the ALU. Using only 4 fundamental functional pathways (add,and,or,xor), it is able to implement 13 planned arithmetic/logical functions (and,test,xor,not,or,sub,cmp,neg,inc,sbb,adc,add,dec).

- Designed the register file you see above. The reason behind having two write ports and four read ports is because of the instruction level parallelism I plan to implement. I imagine this design will probably change numerous times throughout the course of the project. If nothing else, I will have to add more physical registers, as I plan to implement ARM-style register context switching for interrupt routine servicing.

 - Designed the flags register (fairly straightforward).

 - Created a very early design for a (partial) instruction decoder. Currently, it can check validity and decode for any of the ALU-related instructions, outputting valid instructions in an internal format that would be useful for scheduling.

 

In short, most of the work is still ahead of me. Especially when you count the validation / "FPGA Hell" phase. But hey, it's about the journey, right?

I've included some of my early design notes, to give a little more clarity to the direction I've been heading. I'd be curious to hear your thoughts.

Thanks again for your help and resources!

 

- Curt

instr-format.PNG

opcode-alu-tables.PNG

Link to post
Share on other sites
  • 0
1 hour ago, [email protected] said:

Holler if you get stuck, or when you discover you can't get as far in VHDL as I did in Verilog

Really?? You you just couldn't stop yourself ?

I do like the idea of letting C or Python assist with the tedious bookkeeping for things like pipelines. Anyway, my work is done... two minds with a common affliction. Nice! I'll just stick to Zinq when required and state machines or a simple real-time controller where I can. After I designed my last controller what stopped me from publishing it was that I couldn't figure out an application that justified its existence. Still the exercise was both maddening (at times) and fun ( most of the time).

Link to post
Share on other sites
  • 0

 

5 hours ago, [email protected] said:

...I mean, seriously, who would ever load a register from a value pointed to by the same register?  "LOD (R0),R0" ... it doesn't make sense, why would you do that?  Well, GCC created code that did that which my CPU then needed to accommodate.

Not saying I would have thought of it (but then I don't design compilers & CPUs).

It might be just the following code:

int* a = (something);
int b = *a; // and from now on, a is never used again

The lifetime of b begins when a ends, so a and b get optimized into the same register.

 

Link to post
Share on other sites
  • 0

Y

1 hour ago, xc6lx45 said:

The lifetime of b begins when a ends, so a and b get optimized into the same register.

There are several common things that should result in this.  Linked-lists are one example, but there are probably a dozen others.  Of course that is off topic from a code review of a register file for a cpu that doesn't currently have a C/C++ compiler.

Link to post
Share on other sites
  • 0
13 hours ago, [email protected] said:

@CurtP,

Simple pipeline's aren't.  Indeed, debugging the pipeline with all of its corner cases has been a challenge for me and I just wanted to build the simplest pipeline I could.  You might wish to start planning for this ahead of time, since I was perpetually surprised by little nuances I wasn't expecting.  I mean, seriously, who would ever load a register from a value pointed to by the same register?  "LOD (R0),R0" ... it doesn't make sense, why would you do that?  Well, GCC created code that did that which my CPU then needed to accommodate.

If you are interested in register renaming and/or out of order execution and stuff ... think now, before you start, about how you wish to represent the state information from within your CPU as you debug it.  This will be important to you.  Without a good way to view and inspect the problem, you won't be able to move forward to working code.

Will you be supporting unaligned instructions?  Classical RISC ISA's don't, but it's something to consider.

When I was designing my own instruction set, the requirement of only writing one register to the register set at a time prevented me from implementing such instructions as push/pop or iret.  In hind sight, GCC handled the missing push/pop so well you'd hardly know they are missing.  Indeed, the CPU is probably faster as a result.

Oh, I should mention regarding flags ... GCC (or any C compiler for that matter) will want the ability to compare and branch off of any signed or unsigned comparison.  That's =, !=, <, >, <=. and >=.  In other words, you will need to support (somehow) 11 conditions.  The ZipCPU sort of cheats and supports only 7 of these, but it's something to remember.  Also, the flags can be a hassle to get the sign bit and overflow bit right.  Don't forget to adjust the sign bit to keep it correct in case of overflow, or your extreme comparisons won't work.

Looking over your ISA, I noticed ...

  1. You don't seem to have any conditional branch instructions.  Are these the j?? instructions?  Do you have a JSR instruction?
  2. I don't see any multiply or divide instructions.  I didn't have multiply or divide instructions in my first iteration, and needed to come back and add them in.  The ones I now have are three 32x32 bit multiplies returning the top 32 bits if signed, the top 32 bits if unsigned, and the bottom 32-bits.  I've also got two 32x32-bit divide instructions, one signed and one unsigned.  The compiler would love me to have a remainder function, or even a 64x32 divide, but in the ZipCPU architecture those require some software to accomplish.
  3. I didn't see any NOOP instruction.  That was another afterthought instruction of my.  Sure, you could move register A to register A, but such an instruction might stall waiting for A to become available, whereas the NOOP doesn't need to read any instructions.
  4. How about that memory access: will your ISA allow 8-bit byte access to memory?  I had to come back and add byte and halfword instructions into my ISA as an afterthought, when I couldn't get the C-library to compile without them.
  5. While from your description it doesn't sound like you'll struggle from this, I had to wrestle with the realities of linking when I first discovered how a linker worked.  There are two basic instructions the linker wants to adjust: load a value into a register, and jump to a location.  The first one was fairly easy, I took two instructions and I could load any value into any general purpose register.  The second one was harder, but I eventually wrote something similar to what you've described above.  I consider this the LOD (PC),PC instruction--or load the value at the next memory address in the instruction stream into the PC.  It's the only instruction I have like it, as all my other instructions fit into 32'bit words with no immediate's following.

If you are interested, you can see my own instruction cheat sheet here, or a longer discussion of the ISA here.

Good luck!  Holler if you get stuck, or when you discover you can't get as far in VHDL as I did in Verilog ... :P

Dan

 

This gives me a few things to think about. One of the most difficult tasks is discerning the base level functionality you have to provide for an OS and a software toolchain to function. For example, I do want to implement an MMU with some manner of paging mechanism, as well as user/supervisor execution mode (which will of course require the paging mechanism to check page descriptors for security-related attributes).

As for unaligned accesses, the whole topic sounds messy and like it'd add a lot of complexity (and latency). I would like to avoid implementing it if possible. If there is a strong case for supporting unaligned accesses, what is it? This is a question I was pondering just the other day.

Looking at your questions one by one:

1. the j?? instructions are indeed the conditional jump instructions. However, the ones I have planned for may not satisfy all of the comparisons you mentioned a compiler will want (at least not in a single conditional jump instruction). I have planned for the true/false versions testing each flag (flags are carry, zero, negative, overflow; so the conditional jump instructions are jc/jnc, jz/jnz, jn/jnn, jo/jno).

I do not currently have a jsr/call type instruction planned. Correct me if I'm wrong here, but such an instruction struck me as unnecessarily CISC-y. Can't a compliler (or assembly programmer) just prepare the registers and stack according to their chosen calling convention, then execute a jump? I felt that it was unnecessary for my CPU to have to pipeline a series of operations just to set up a subroutine call for the programmer, especially considering that there are numerous calling conventions to choose from, with new ones being added periodically.

2. Multiply and divide are tentatively (but not firmly) planned. See that empty light blue row in my opcode map? That's where they'd go. Of course, implementing these has its own set of challenges. Not only are these complex, multi-cycle operations, but you also have to make arrangements to take operands of machine word size, and write back to a register pair rather than a single register (though accommodating this was one of my motivations for dual write ports on the register file). I do like the approach you mentioned regarding how to handle the signed/unsigned cases..

3. My 'nop' instruction is located right next to xchg at opcode 11010. I had tentatively planned to implement it in the old Intel convention of having it essentially encode to "xchg r0,r0". But I have also given thought to how I might want to implement various kinds of explicit and implicit no-ops of varying types (cycle count, serializing or non-serializing, etc.). Do you have any specific recommendations in this regard? Sticking nop next to xchg in the opcode table was sort of my default answer.

4. My design is set up for 32-bit machine words operating on byte-addressable memory, but I hadn't currently planned for specifically loading only a byte out of memory. There are ways that I could go about implementing this within the structure of the current instruction format, but it will be more of a challenge to implement the halfword case than the byte case. If both are necessary though, I'll figure it out.

5. I'm a little confused by the challenge you're describing here. As you mentioned, the loading a value in to a register part is easily covered. But doesn't an unconditional jump instruction satisfy the requirement of loading a value to the program counter? I mean, effectively the unconditional jump is just a load to PC but with privilege checks.

Also, it's very interesting you mentioned that you didn't have to implement push/pop. I assumed that a compiler would require these, at least for subroutine calls. You're saying that a compiler will manage a stack using only regular loads and stores? I'm inclined to implement push/pop if practical, because it's a very useful convenience at the assembler level if nothing else. As for int/iret, how did you manage to get a working design without these? I assumed that OSes would require the presence of software interrupt instructions, in order to facilitate user-mode code jumping to system subroutines. Absent this, what would an OS have the programmer do? It seems like the user-mode code would have to place parameters somewhere and then deliberately trigger an exception so that the OS could enter a service routine, read the parameters, and do its thing before returning.

Thanks again for helping me answer some of the tougher questions early, on, rather than later when making changes gets exponentially harder. :)

 

- Curt

Edited by CurtP
Link to post
Share on other sites
  • 0
10 hours ago, Piasa said:

Just realized I didn't respond to this yesterday.

By infrastructure, I mean clocks and general resets that are used at startup or otherwise not used during normal operating conditions.

The distributed memory approach was presented a few posts up.  It doesn't make sense for your goal of an easy to read, general implementation.

In many designs, it is preferred to have output registers.  This does add the 1 register delay.  In some cases the unregistered outputs are unavoidable.  When possible, having output registers is nice because the longest path won't be half in one file and half in another.  For re-use, remember that you can't control how other people use your module.

"last signal wins" is a reliable behavior.  That said, it can be abused.  There are structured uses where it can be very useful.  However, it creates a bottom-to-top priority structure.  For this reason, it should only be used in a manner that is unlikely to confuse a reader. 

Thanks for getting back to me!

Are there any specific pros/cons to registering the outputs of the register file that I should be aware of? This design is specifically targeted for use in a bespoke CPU design, not as a general purpose module to be re-purposed by others. So in the context of my own design, what should I worry about? All module outputs are sampled at the rising edge of the clock. Would leaving the outputs unregistered put my design at risk of sampling spurious/undefined values from the register file based on factors like clock skew or others?

And thanks for the clarification on the 'last signal assignment wins' behavior. There are some cases where exploiting this trait is very useful, though you are correct that it may affect code clarity, especially for those not already well-acquainted with VHDL.

Link to post
Share on other sites
  • 0

@CurtP,

I've been following this thread much the same way as people who watch soaps... or fireworks.... the thread is throbbing with excitement.

So, I don't know if you're a genius or agog with ideas that will never be built and debugged, or someone who just wants to get to the part where tomatoes get plucked from the garden. I understand; inquiring minds want to know. I've got my own afflictions in that regard. The following comments have nothing to do with CPU design, or the excitement of discussing interesting aspects of any particular project. I'm just putting the discussion into the context of a guy who started a thread wanting feedback on a rather simple logical structure.

If I were going to build a personal manned rocket I'd want to attempt and succeed at smaller projects before strapping myself into a piece of home-built hardware and pushing the big red button. But then I'm just a pedestrian engineer. Along with succeeding at smaller but increasingly complex projects you gain a lot of skill at understanding the basic but peripheral knowledge and skills in using the tools that are needed to accomplish a complex project. This includes the Vivado tools, the languages, the bugs in the tools, best practice in implementing complex logic elements, timing, constraints, etc. I do realize that your goals and choice as to how you get there has to be your decision alone.

Were you to start a project vault project with an end goal of achieving a unique CPU with unique objectives this might be very popular and instructive to a wide audience. I'm thinking of a project consisting of a series of smaller projects culminating in one big flourish. It would produce not just code and techniques but convey the development complexities in an incremental and natural manner. I'm not saying that this is necessarily a good idea or something that you ever though about doing.... just interesting and it sure would expose a lot of those peripheral issues and strategies. My suggestion admittedly is asking a lot of you. You may have no intention of publishing any of your hard work. It would be a unique project, interesting to a lot of people and instructive to many more.

Mostly, don't let anything that I say restrain your ambition or enthusiasm.

Link to post
Share on other sites
  • 0
5 minutes ago, zygot said:

@CurtP,

I've been following this thread much the same way as people who watch soaps... or fireworks.... the thread is throbbing with excitement.

So, I don't know if you're a genius or agog with ideas that will never be built and debugged, or someone who just wants to get to the part where tomatoes get plucked from the garden. I understand; inquiring minds want to know. I've got my own afflictions in that regard. The following comments have nothing to do with CPU design, or the excitement of discussing interesting aspects of any particular project. I'm just putting the discussion into the context of a guy who started a thread wanting feedback on a rather simple logical structure.

If I were going to build a personal manned rocket I'd want to attempt and succeed at smaller projects before strapping myself into a piece of home-built hardware and pushing the big red button. But then I'm just a pedestrian engineer. Along with succeeding at smaller but increasingly complex projects you gain a lot of skill at understanding the basic but peripheral knowledge and skills in using the tools that are needed to accomplish a complex project. This includes the Vivado tools, the languages, the bugs in the tools, best practice in implementing complex logic elements, timing, constraints, etc. I do realize that your goals and choice as to how you get there has to be your decision alone.

Were you to start a project vault project with an end goal of achieving a unique CPU with unique objectives this might be very popular and instructive to a wide audience. I'm thinking of a project consisting of a series of smaller projects culminating in one big flourish. It would produce not just code and techniques but convey the development complexities in an incremental and natural manner. I'm not saying that this is necessarily a good idea or something that you ever though about doing.... just interesting and it sure would expose a lot of those peripheral issues and strategies. My suggestion admittedly is asking a lot of you. You may have no intention of publishing any of your hard work. It would be a unique project, interesting to a lot of people and instructive to many more.

Mostly, don't let anything that I say restrain your ambition or enthusiasm.

I get where you're coming from. If one isn't careful, they can extinguish their own enthusiasm for something by taking on way too big of a task, way too early.

 

@[email protected] has strongly suggested that I build the peripherals for my CPU before building my CPU. This is probably the course of action I'll ultimately take (maybe building small parts of the CPU along the way). This approach could kill two birds with one stone: setting me up with a proper debug environment, and giving me some more simple projects to work on first.

Link to post
Share on other sites
  • 0
8 minutes ago, CurtP said:

I get where you're coming from. If one isn't careful, they can extinguish their own enthusiasm for something by taking on way too big of a task, way too early.

Boy, I'd love to hear @kc5tja's comments on this line.  It'd be fun to hear a status on his project too, since he was last set back.  Judging from his project log since then, though, it looks like he's managed to recover from his set back.  However, as a lesson for new CPU developers, you might wish to look at the date stamps on his log.  Things like this take time.  They can also be a test of patience.

Dan

Link to post
Share on other sites
  • 0
1 hour ago, CurtP said:

Are there any specific pros/cons to registering the outputs of the register file that I should be aware of?

Probably not.  You just need to be aware that the register outputs have some delay from the logic that is in the register file.  Registering outputs is "generally good" design.  However, it isn't always needed or possible.  It is up to you to decide the logical impact in this case and then compare against any performance benefits.

Link to post
Share on other sites
  • 0

Regarding pipelining and registered outputs ( or inputs ).

If your logic were to be implemented in basic gates, like in the "old days", e.g. and, or, not etc. the more complex the logic the more levels of delay you would inherit. Once the combinatorial delay exceeds the target data rate ( clock period ) you will need to pipeline. Pipelining introduces a number of complexities that have to be accounted for, debugging not being the least important. Of course your Series 7 FPGA uses LUTs instead of discrete gates, but a similar fate transpires. To achieve the highest repeatable performance ( as the routing resources get used up and small changes don't result in a major placement strategy ) from your synthesis tool adding registers separating less complex combinatorial logic structures is the solution. If you are using an ASIC or gate array you might have other strategies. At the minimum this helps the synthesis and place and route tools figure out how to implement your design. An interesting experiment is to do a few designs with block memory selecting various input and output registering strategies within a somewhat complex design. See what happens to your timing and the placement of related logic. Try using block memory as asynchronous RAM. If you try creating a simple controller that can be scaled heir-archly with generic assignments and want a high clock rate you will see what I mean. I've done this. Just inserting a small high data rate pipelined structure into a very complex design and having it maintain the required timing can be difficult. This is a basic concept to master before embarking on complex projects. As to when this is necessary, there is a bit of an art ( experience ) to making the correct decision at the beginning of a project. It is not uncommon to get 80% of the way to completing a project only to find that the fundamental strategy is flawed and the way forward involves a lot more redesign and restructuring than you care to do. SOP in deadline driven commercial setting. ( that is projects that take 8 months that were supposed to take 2 when getting it right the first time would have gotten you finished in 4. )

Edited by zygot
Link to post
Share on other sites
  • 0
3 hours ago, [email protected] said:

Boy, I'd love to hear @kc5tja's comments on this line.  It'd be fun to hear a status on his project too, since he was last set back.  Judging from his project log since then, though, it looks like he's managed to recover from his set back.  However, as a lesson for new CPU developers, you might wish to look at the date stamps on his log.  Things like this take time.  They can also be a test of patience.

Dan

In order of mention...

Status on Kestrel Project.  I went back to working on the Kestrel-2 and creating a refinement of this architecture.  Instead of the 16-bit stack architecture CPU, however, I replaced the core with my KCP53000 CPU, a M-mode only RV64I RISC-V processor.  This has allowed me to expand the design of the computer rather significantly relative to the original design.  Kestrel-2's address space was laid out like so:

$0000 $7FFF Program RAM
$8000 $BFFF I/O space
$C000 $FFFF Video RAM

The block RAMs were pre-loaded with software to run at synthesis time.  There is no ROM, and the video display was driven at 640x200 monochrome (bitmapped).

The Kestrel-2DX, the modern incarnation of the basic concept, is substantially renovated.  As indicated above, the CPU is now a 64-bit RISC-V core, with a memory map as shown here: http://chiselapp.com/user/kc5tja/repository/kestrel-2dx/wiki/Memory Map

It has a proper ROM (which is implemented in Verilog as a giant case-statement because I don't have enough block RAMs to use as a ROM) which holds a very minimal BIOS-like thing.  This frees up quite a bit of space from RAM, where I am currently writing a dialect of Forth to serve as its system software.

This design is, however, pushing the limits of the Digilent Nexys-2 FPGA board.  Although I have plenty of logic left, the fact that the ROM is synthesized from LUTs is enough of a burden to drop the maximum clock speed to just above 26MHz, which is dangerously close to the 25MHz it's designed to run at.

Of all the computer designs I've made, I've been especially happy with this one.  Despite not being finished yet, I'm having a total blast with it, which is exactly what I wanted from my neo-retro computer designs.  It looks, feels, and behaves like a classic computer, despite having a modern 64-bit core.  I've won.  (I just need to finish Forth for it!)

The Kestrel-3 will be a new computer design with somewhat more modern capabilities.  First and foremost, it'll be my first design based around the Chisel-3 DSL.  I've finally learned enough to feel comfortable with it.  (Another personal victory!)  The K3 will be built using only open-source FPGA boards though (e.g., BlackIce and/or icoBoard Gamma), which can be targeted with the Yosys development chain.  There are several reasons for this, not the least of which is because I want to support that community.  I'm planning on a computer with two boards: one comprising the CPU and RAM, and another comprising "the chipset" of the rest of the computer (e.g., video, SD card, keyboard, sound, etc.).

Originally, I wanted to target the Altera/Terasic DE-1 FPGA board (since it's available for dirt cheap these days), but I've received enough feedback from my friends and followers of the project that they wanted to follow along but were hesitant to install Altera's ginormous IDE on their box.  They wanted something that could run reliably on a Raspberry Pi, and right now, that means Yosys.  This fundamentally changes my plans for this computer, and it's not clear I have a good design for it yet.

One thing is clear though -- the Kestrel-2DX will end up being an early development terminal for the Kestrel-3.  I eat my own dogfood.

The Set Back.  This problem still exists.  The Nexys-2's PSRAM chip remains dead to the world for me.  I've long since given up with this chip.  Near as I can tell, the *only* project that successfully reports success with it is the Nexys-2 BIST bitstream, which leads me to simply not trust this BIST.  I *have*, however, written designs to access the SRAM on the icoBoard and have successfully confirmed my ability to read and write to that board's SRAM chip.  So I'll be going that route.  Another reason to use these boards instead of the DE-1; anything more complex than basic SRAM is straight-up frightening to me.  I've been burned enough to never want to use them again.

Once I get a working platform that boots on its own with SRAM but without SDRAM, then I have a basis on which I can tweak the design and run software to exercise the SDRAM chips.  With luck, things will work.  But I want a known-good platform first and foremost.

The Future.  I never made progress with my original Kestrel-3 design or intentions.  Reverting to working on the Kestrel-2 and upgrading it to the new Kestrel-2DX design has restored my interest and faith in my abilities as a hobby hardware designer.  While I still have plans for the Kestrel-3 (see http://chiselapp.com/user/kc5tja/repository/kestrel-3/wiki/Base Specs), it's not clear how I'll achieve these goals just yet.

My current plans are to perform the following broad steps for development:

  1. Develop a dumb GPIO adapter.  If I stick with Wishbone B.4/Pipelined, this is already done.  I've been strongly considering switching to TileLink TL-UL though.  This might give me wider access to parts written by others for the RISC-V ecosystem.
  2. Develop a debug controller, where I can send read/write byte/half-word/word/double-word requests to.  Since I have access to raw GPIO on the Kestrel-2DX, this is not likely to use RS-232 framing or anything.  It'll probably be bit-banged, for simplicity's sake.  A few PMODs will be needed for this.  This will serve as a surrogate for the final CPU design that I intend.
  3. Make sure I can toggle LEDs using the debug port interactively from the Kestrel-2DX.
  4. Port my Serial Interface Adapter core to the Kestrel-2DX.  Confirm it works in loop-back mode.
  5. Port my Serial Interface Adapter to both the Kestrel-3 designs.
  6. Interactively confirm that the serial link works on the Kestrel-3 in loop-back mode.
  7. Interactively confirm that the serial link works between the 2DX and the 3.
  8. Develop final SRAM interface.
  9. Make sure I can perform basic RAM tests interactively from the Kestrel-2DX.
  10. Develop a "ROM" system using block RAMs.  (from CPU's perspective, it's ROM; from debug interface, it's RAM.)
  11. Make sure I can write to and read back from the "ROM" interactively from the Kestrel-2DX.
  12. Port the KCP53000 to run on the new platform.
  13. Write first-boot firmware that writes "Hello world" to the SIA or something.  Upload it from the Kestrel-2DX.
  14. Boot the Kestrel-3 for the first time, and hope for the best.

This will likely change as I learn more about the design.  Note how none of this even concerns itself with the graphics, sound, or other goodies I've been looking for.  Unlike the Kestrel-2DX, it doesn't even have the MGIA to fall back on.  This is because the CPU will consume the overwhelming majority of the iCE40HX8K part; I'll probably need to off-load the niceties to a slave peripheral that's PMOD-accessible.

Link to post
Share on other sites
  • 0
20 minutes ago, [email protected] said:

@kc5tja,

Wow!  That's a nice status update, and I'm glad to hear you are moving along!

Would you offer any words of wisdom to someone just starting out with their own CPU design?

Dan

Yes.

You are going to fail.  You are going to fail hard.  You are going to fail so hard, you'll want to flip your table, walk away, curse everything as a waste of time, and never look back.

Do all of these things; except, I'd recommend not flipping that table.  I find the cursing to be cathartic, and the walking to be mind-clearing.  Maaaaaaaybeee try not to be as public about the cursing as *I* have been.  I have a reputation.  You might not, and it could damage yours.  But if you must, curse into an empty room.  Scream loud if you must.  Get it off your chest; then, get back on the wagon.

Walk away; walk far, far away.  Never look back; if you do, you'll tag some of that baggage along with you.  Drop it like a moldy sack of hot potatoes.  However, as I said before, don't flip that table!  Even though you might not look back, that doesn't mean you won't *be* back.  Life finds a way.  It always does.  It just takes longer than you'd like sometimes.

Instead, strive for small victories.  Remember where things last worked.  You are exploring a multi-faceted design *space*, not a single path on a 2-dimentional map.  My 14-step development plan I wrote above?  It's just my current vision.  It WILL change.  And so will yours.  Accept this as normal.  Frustrating!!  Absolutely!  But definitely normal!

Because after you walk away, eventually, you'll want to return.  And when you do, you can wipe the table clean, and go back to the last thing you know worked.  Pick up the pieces from there and build upon your successes.  Your progress will be a slog, but eventually, you'll find a way towards your goal.

I'll let you know when I've found mine.

 

Link to post
Share on other sites
  • 0
2 minutes ago, [email protected] said:

Thank you, @kc5tja!

@CurtP, I asked @kc5tja's perspective because I think it might help you put things in perspective.  Building a CPU is fun, @kc5tja describes it as addictive ;), but it will also be quite a long and frustrating journey.

Dan

Talk of pipelines is poignant for me, as one of the biggest differences between the Kestrel-2DX's KCP53000 and the Kestrel-3's KCP53010 will, in fact, be that the latter has a 5-stage (maybe 6-stage, not sure yet) pipeline.  They should otherwise be software compatible with each other.  (The other being that some form of memory protection will be introduced; probably in the form of software-managed TLBs.)

Link to post
Share on other sites
  • 0

Tell me that I'm an idiot if you want (I really don't mind) but.... I predict a lot less cursing and frustration if you develop the FPGA craft skills before pouring hours into the implementation stage where the initial product is supposed to rival current state of the art processors. The last 20 years has given us hardware optimizations like out-of-order execution, speculative branching and the like and it's just been recently that we've been served the bill ( from a security perspective ). I get the passion. I like it. I don't get masochism. I don't get wanting an end product without wanting to understand the process to achieve it. So I'll channel your moms... "well dear, as long as it makes you happy.." 

Link to post
Share on other sites
  • 0
1 hour ago, zygot said:

Tell me that I'm an idiot if you want (I really don't mind) but.... I predict a lot less cursing and frustration if you develop the FPGA craft skills before pouring hours into the implementation stage where the initial product is supposed to rival current state of the art processors. The last 20 years has given us hardware optimizations like out-of-order execution, speculative branching and the like and it's just been recently that we've been served the bill ( from a security perspective ). I get the passion. I like it. I don't get masochism. I don't get wanting an end product without wanting to understand the process to achieve it. So I'll channel your moms... "well dear, as long as it makes you happy.." 

The bill came due because of laziness.  Speculation which respects permissions boundaries would have been perfectly fine.  It is the fact that CPUs speculate without respect for permissions is what lead directly to Spectre (at least the version that lets you read into kernel memory).

That said, my plans are not to go hog-wild with runtime optimizations.  An in-order pipeline is a natural, relatively inexpensive performance boost.  As I indicated elsewhere, I already have a CPU that runs at 25MHz, but it needs minimum 3, maximum 7 cycles per instruction.  I'd like to drop that as much as I can.  I really enjoy estimating performance by counting instructions and treating them as single-cycle abstractions.  It's also a great help when bit-banging I/O.

Out-of-order and/or speculation are necessary only to compensate for ultra-deep pipelines.  Keep the pipes short, and you simply don't need speculation to meet your performance goals.  Much of the performance gains you'd expect to come from superscalar execution can be had with macro-op fusion.

Edited by kc5tja
Link to post
Share on other sites
  • 0
17 hours ago, kc5tja said:

Yes.

You are going to fail.  You are going to fail hard.  You are going to fail so hard, you'll want to flip your table, walk away, curse everything as a waste of time, and never look back.

Do all of these things; except, I'd recommend not flipping that table.  I find the cursing to be cathartic, and the walking to be mind-clearing.  Maaaaaaaybeee try not to be as public about the cursing as *I* have been.  I have a reputation.  You might not, and it could damage yours.  But if you must, curse into an empty room.  Scream loud if you must.  Get it off your chest; then, get back on the wagon.

Walk away; walk far, far away.  Never look back; if you do, you'll tag some of that baggage along with you.  Drop it like a moldy sack of hot potatoes.  However, as I said before, don't flip that table!  Even though you might not look back, that doesn't mean you won't *be* back.  Life finds a way.  It always does.  It just takes longer than you'd like sometimes.

Instead, strive for small victories.  Remember where things last worked.  You are exploring a multi-faceted design *space*, not a single path on a 2-dimentional map.  My 14-step development plan I wrote above?  It's just my current vision.  It WILL change.  And so will yours.  Accept this as normal.  Frustrating!!  Absolutely!  But definitely normal!

Because after you walk away, eventually, you'll want to return.  And when you do, you can wipe the table clean, and go back to the last thing you know worked.  Pick up the pieces from there and build upon your successes.  Your progress will be a slog, but eventually, you'll find a way towards your goal.

I'll let you know when I've found mine.

 

Only experience can teach what this truly is like, but I can say that this is pretty much what I was expecting. The people who succeed the most often fail the most, because they actually tried. Success is primarily a function of recovery from failure. But at least I have the benefit of being on my own timeline and own (non-)budget. Just me, my free time, an Arty S7, and my sanity (for now).

That said, the prevailing advice seems to be to build smaller projects first. I initially had figured "well, the smaller functional units -are- smaller projects". But I can see why it would be advantageous to do something more fundamental first, since it'll give me a chance to learn more about completing a whole project, the subtleties of FPGA design, and working with I/O. I'll think of something neat to do (after all, I've got a bunch of PMODs sitting around waiting to be toyed with). If nothing else, I can build myself a debug framework over UART.

Totally agree about the importance of small victories. Sometimes all it takes to keep your motivation going is to see your design blink some LEDs just like it was supposed to.

Thanks for the words of advice!

Edited by CurtP
Link to post
Share on other sites
  • 0
15 hours ago, kc5tja said:

The bill came due because of laziness. 

My how this thread has evolved into a life of its own... and I can't resist egging it on.

In my experience stupidity and malfeasance on the corporate level rarely has to do with laziness. In order of precedence it would be... ego, attaining or defending status,  greed, the fact that most corporations mimic the armed services structure; that is decision making goes down the chain of command and facts rarely travel up the chain, and in large companies departmental tribalism. In my experience merely holding beliefs that counter-indicate prevailing commands ( or failing to cheer on views supporting those commands ) are viewed as mutiny and a threat.   

Link to post
Share on other sites
  • 0
1 minute ago, CurtP said:

That said, the prevailing advice seems to be to build smaller projects first. I initially had figured "well, the smaller functional units -are- smaller projects". But I can see why it would be advantageous to do something more fundamental first, since it'll give me a chance to learn more about completing a whole project, the subtleties of FPGA design, and working with I/O. I'll think of something neat to do (after all, I've got a bunch of PMODs sitting around waiting to be toyed with). If nothing else, I can build myself a debug framework over UART.

Just my (obvious) opinion but I heartily approve...

If you were doing this for a living you'd have lots of guidance and known good code to help with this process. On you own it's a bit tougher row to plow. If you look around you can find well written code to help. But nothing, nothing circumvents learning for yourself through experience.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now