Jump to content

RISC-V RV32I CPU/controller


hamster

Recommended Posts

I've just posted my holiday project to Github - Rudi-RV32I - https://github.com/hamsternz/Rudi-RV32I

It is a 32-bit CPU, memory and peripherals for a simple RISC-V microcontroller-sized system for use in an FPGA.

A very compact implementation and can use under 750 LUTs and as little as two block RAMs -  < 10% of an Artix-7 15T.

All instructions can run in a single cycle, at around 50MHz to 75MHz. Actual performance currently depends on the complexity of system bus.

It has full support for the RISC-V RV32I instructions, and has supporting files that allow you to use the RISC-V GNU toolchain (i.e. standard GCC C compiler) to compile programs and run them on your FPGA board. 

Here is an example of the sort of code I'm running on it - a simple echo test:, that counts characters on the GPIO port that I have connected to the LEDs.

// These match the address of the peripherals on the system bus.
volatile char *serial_tx        = (char *)0xE0000000;
volatile char *serial_tx_full   = (char *)0xE0000004;
volatile char *serial_rx        = (char *)0xE0000008;
volatile char *serial_rx_empty  = (char *)0xE000000C;
volatile int  *gpio_value       = (int  *)0xE0000010;
volatile int  *gpio_direction   = (int  *)0xE0000014;

int getchar(void) {
  // Wait until status is zero 
  while(*serial_rx_empty) {
  }
  // Output character
  return *serial_rx;
}

int putchar(int c) {
  // Wait until status is zero 
  while(*serial_tx_full) {
  }
  // Output character
  *serial_tx = c;
  return c;
}

int puts(char *s) {
    int n = 0;
    while(*s) {
      putchar(*s);
      s++;
      n++;
    } 
    return n;
}

int test_program(void) {
  puts("System restart\r\n");  

  /* Run a serial port echo */
  *gpio_direction = 0xFFFF;
  while(1) {
    putchar(getchar());
    *gpio_value = *gpio_value + 1;
  }
  return 0;
}

As it doesn't have interrupts it isn't really a general purpose CPU, but somebody might find it useful for command and control of a larger FPGA project (converting button presses or serial data into control signals). It is released under the MIT license, so you can do pretty much whatever you want with it.

Oh, all resources are inferred, so it is easily ported to different vendor FPGAs (unlike vendor IP controllers)

Link to comment
Share on other sites

For comparison, I got the following LUT counts for James Bowman's J1B (16 bit instruction, 32 bit ALU) CPU which I know quite well:

* 673 LUTs = 3.3% utilization of A7-35 with 32 stack levels in distributed RAM (replacing the original shift register based stack which does not look efficient on Xilinx 7 series)
* 526 LUTs if reducing the +/- 32 bit barrel shifter to +/- 1 bit, but the performance penalty is severe (e.g. IMM values need to be constructed from shifts).
* 453 LUTs if further allowing one BRAM18 for each of the two stacks. This includes a UART and runs at slightly more than 100 MHz but memory/IO need two instructions / two cycles.

So the RISC "overhead" does not seem that dramatic. It's slightly bigger, somewhat slower but has baseline opcodes (e.g. arithmetic shift and subtract, if I read it correctly) that J1B needs to emulate in SW).

It would be interesting to know where the memory footprint goes when I use (soft) floats. I've done the experiment in the recent past with microblaze MCS, and did not like what I saw. On J1B I need about 320 bytes for (non IEEE 754) float + -  * / painfully slow without any hardware support but it keeps the boat afloat, so to speak.

Using C instead of bare metal assembly would be tempting.... I just wonder how much effort it takes to install the toolchain.

 

Link to comment
Share on other sites

The toolchain is pretty simple to build but takes a while - for me it was just clone https://github.com/riscv/riscv-gnu-toolchain, make /opt/riscv (and change ownership), then run './configure' with the correct options, then 'make'.  There are a whole lot of different Instruction set options and ABIs, so I definitely recommend building from source rather than downloading prebuild images.

At the moment I haven't included any of the stdlib or soft floating point. I'll add that to the "todo someday" list.

Link to comment
Share on other sites

3 hours ago, xc6lx45 said:

For comparison, I got the following LUT counts for James Bowman's J1B (16 bit instruction, 32 bit ALU) CPU which I know quite well:

* 673 LUTs = 3.3% utilization of A7-35 with 32 stack levels in distributed RAM (replacing the original shift register based stack which does not look efficient on Xilinx 7 series)
* 526 LUTs if reducing the +/- 32 bit barrel shifter to +/- 1 bit, but the performance penalty is severe (e.g. IMM values need to be constructed from shifts).
* 453 LUTs if further allowing one BRAM18 for each of the two stacks. This includes a UART and runs at slightly more than 100 MHz but memory/IO need two instructions / two cycles.

So the RISC "overhead" does not seem that dramatic. It's slightly bigger, somewhat slower but has baseline opcodes (e.g. arithmetic shift and subtract, if I read it correctly) that J1B needs to emulate in SW).

It would be interesting to know where the memory footprint goes when I use (soft) floats. I've done the experiment in the recent past with microblaze MCS, and did not like what I saw. On J1B I need about 320 bytes for (non IEEE 754) float + -  * / painfully slow without any hardware support but it keeps the boat afloat, so to speak.

Using C instead of bare metal assembly would be tempting.... I just wonder how much effort it takes to install the toolchain.

 

I just had a look at the J1b source, and saw something of interest (well, at least to weird old me):

        4'b1001: _st0 = st1 >> st0[3:0];
        ....
        4'b1101: _st0 = st1 << st0[3:0];

A 32-bit shifter takes two and a half levels of 4-input, -2 select MUXs per input bit PER DIRECTION (left or right) and the final selection between the two takes another half a LUT, so about 160 LUTs in total (which agrees with the numbers above)

However, if you optionally reverse the order of bits going in, and then also reverse them going out of the shifter, then the same shifter logic can do both left and right shifts.

This needs only three and a half levels of LUT6s, and no output MUX is needed. That is somewhere between 96 and 128 LUTs, saving maybe up to 64 LUTs.

It's a few more lines of quite ugly code, but might save ~10% of logic and may not hit performance (unless the shifter becomes the critical path...).

Link to comment
Share on other sites

  • 1 month later...

A follow-up thought on the resource comparison I posted earlier:

The J1 uses combinational (not synchronous) memory read: The instruction word is available decoding immediately after the PC/instruction ROM address changes.
This makes for a very simple architecture but might be considered a non-standard hardware feature (e.g. it won't work on Lattice ICE40 because the BRAMs there have synchronous read only. The instruction ROM needs to go into LUTs). So in that sense, the comparison isn't fair.

Link to comment
Share on other sites

  • 3 weeks later...

... and one random thought, just something to be aware of: I think RISC CPUs work pretty smoothly on Xilinx because the huge (31x32?) register file basically comes for free via distributed LUT RAM that is automatically inferred. I haven't tried but I suspect if I'd synthesize this on e.g. Lattice ICE40, I'd probably get one FF with associated LUT per register bit (or sacrifice two BRAMs for dual read ports).

Maybe something to consider regarding portability of soft-core CPUs in general. On Xilinx, it makes sense as it uses an available hardware feature efficiently.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...