• Content count

  • Joined

  • Last visited

  • Days Won


D@n last won the day on January 18

D@n had the most liked content!

About D@n

  • Rank
    Prolific Poster

Contact Methods

  • Website URL

Profile Information

  • Gender
    Not Telling
  • Interests
    Building a resource efficient CPU, the ZipCPU!
  1. FIFO CDC and Gray codes

    @zygot, You might consider reading the paper I referenced. It's worth the read. Dan
  2. FIFO CDC and Gray codes

    @RedMercury, As I recall, Cliff Cummings deals with these issues in his paper on the topic, Dan
  3. Feedback on a register file design?

    Thank you, @kc5tja! @CurtP, I asked @kc5tja's perspective because I think it might help you put things in perspective. Building a CPU is fun, @kc5tja describes it as addictive , but it will also be quite a long and frustrating journey. Dan
  4. Feedback on a register file design?

    @kc5tja, Wow! That's a nice status update, and I'm glad to hear you are moving along! Would you offer any words of wisdom to someone just starting out with their own CPU design? Dan
  5. Feedback on a register file design?

    Boy, I'd love to hear @kc5tja's comments on this line. It'd be fun to hear a status on his project too, since he was last set back. Judging from his project log since then, though, it looks like he's managed to recover from his set back. However, as a lesson for new CPU developers, you might wish to look at the date stamps on his log. Things like this take time. They can also be a test of patience. Dan
  6. Feedback on a register file design?

    @CurtP, Let's see ... The ZipCPU doesn't officially implement a JSR instruction either, even though the compiler *really* wants one. To deal with this case, I taught the assembler and disassembler that a particular two instruction combination was the JSR instruction: MOV 2+PC,R0 followed by JMP <address>. Typically, this was implemented as a long jump to the address, since the assembler never knew where the address would be 'til link time, and the linker wanted to place a 32-bit address into the instruction stream somewhere. As I mentioned before, my long jumps were implemented by loading the value following the current instruction word into the PC, and woodenly encoded as LW (PC),PC. Actually ... the ZipCPU doesn't even have jump instructions per se, but the assembler hides this lack. The ADD instruction provides the other alternative: ADD.C <offset>,PC adds, if the condition C is true, the given offset to the PC. The assembler will quietly turn BRA, BNZ, BLT, etc. into this instruction if the target fits, and the disassembler replaces these instructions with their Bxx equivalents. The C-library will require sub-word addressable memory for its string operations. Plan on needing arbitrary 16-bit and 8-bit load and store capability, or giving up on the C-library and implementing portable code. An unconditional jump does need the capability to load an arbitrary value into the PC, yes. At issue, though, is how you will come back to your machine code and place that address into your instruction stream after compilation and assembly have both finished without knowing what the value should be. GNU's binutils helps, but you'll still need to write the hooks for your own processor. So, moving on to push and pop. The most common case for these routines is when you want to add (or remove) an item from the stack. In my case, GCC calculates the stack size ahead of time, and then subtracts the stack size for the whole routine upon startup. Any register saves will be immediately placed into known positions on the stack afterward. Hence, the startup for a subroutine might look like: subroutine: SUB 24,SP STO R0,(SP) STO R1,4(SP) STO R2,8(SP) STO R3,12(SP) ... compiler generated user code goes here LOD R0,(SP) LOD R1,4(SP) LOD R2,8(SP) LOD R3,12(SP) JMP R0 ; This is the ZipCPU's return instruction The neat thing about how I've set up the bus is that only the first of these loads or stores will cost any bus delays. The second and subsequent (in any string of them) will cost only one additional clock--depending, of course, on the speed of the memory at the other end. For INT/IRET instructions ... the ZipCPU supports two modes a user mode (where interrupts are enabled) and a supervisor mode (where interrupts are disabled). On an interrupt or an exception, the CPU just switches register sets in order to switch modes. The actual mode is kept in the flags register, so any write that changes this mode will cause the CPU to switch modes and hence register sets. Incidentally, this makes it *really* easy to write interrupt routines: they are just written in "C" as part of the supervisor code. When the supervisor is ready to switch to the user mode, it just issues a zip_rtu() command. This turns into an OR 0x100,CC instruction which turns on the interrupt enabled bit and the CPU switches modes. Incidentally ... getting the pipeline working for this, including all of the corner cases, was a real pain in the bitstream. To implement a system call, I'd just call a function. That function would contain the one assembler instruction, "LDI 0,CC", which would then disable interrupts, switching the CPU to supervisor mode--leaving all the user registers intact as though the function were actually called. From supervisor mode, the software can do what it then likes with those register values. There are other possibilities for entering supervisor mode as well. For example, a division by zero error, hitting a debugging break point, at the conclusion of a single-stepped instruction, on a bus error, after hitting an illegal instruction, trying to execute an instruction from non-existent memory, etc. When the supervisor code has dealt with whatever the exception was, it just calls zip_rtu() which executes a built-in RTU (return-to-userspace) instruction. There are other built-ins to help out as well, such as zip_save_context(contextp); which stores the user registers into the array pointed by contextp and zip_restore_context(contextp) which does the reverse, etc. Hence, to swap tasks, you set a timer interrupt. When that interrupt goes off, you save the registers into an array associated with the current task, and then load the registers from the task you want to switch to. Once you then return to userspace, the task swap is complete. Still, the "tough" question early on is: how will you simulate your design, how will you visualize your pipeline, and how will you debug your software (and CPU) once you move to the actual hardware. These are the real questions you need to answer up front and immediately. Everything else follows from the answers you give to these questions. Dan
  7. Feedback on a register file design?

    @CurtP, Simple pipeline's aren't. Indeed, debugging the pipeline with all of its corner cases has been a challenge for me and I just wanted to build the simplest pipeline I could. You might wish to start planning for this ahead of time, since I was perpetually surprised by little nuances I wasn't expecting. I mean, seriously, who would ever load a register from a value pointed to by the same register? "LOD (R0),R0" ... it doesn't make sense, why would you do that? Well, GCC created code that did that which my CPU then needed to accommodate. If you are interested in register renaming and/or out of order execution and stuff ... think now, before you start, about how you wish to represent the state information from within your CPU as you debug it. This will be important to you. Without a good way to view and inspect the problem, you won't be able to move forward to working code. Will you be supporting unaligned instructions? Classical RISC ISA's don't, but it's something to consider. When I was designing my own instruction set, the requirement of only writing one register to the register set at a time prevented me from implementing such instructions as push/pop or iret. In hind sight, GCC handled the missing push/pop so well you'd hardly know they are missing. Indeed, the CPU is probably faster as a result. Oh, I should mention regarding flags ... GCC (or any C compiler for that matter) will want the ability to compare and branch off of any signed or unsigned comparison. That's =, !=, <, >, <=. and >=. In other words, you will need to support (somehow) 11 conditions. The ZipCPU sort of cheats and supports only 7 of these, but it's something to remember. Also, the flags can be a hassle to get the sign bit and overflow bit right. Don't forget to adjust the sign bit to keep it correct in case of overflow, or your extreme comparisons won't work. Looking over your ISA, I noticed ... You don't seem to have any conditional branch instructions. Are these the j?? instructions? Do you have a JSR instruction? I don't see any multiply or divide instructions. I didn't have multiply or divide instructions in my first iteration, and needed to come back and add them in. The ones I now have are three 32x32 bit multiplies returning the top 32 bits if signed, the top 32 bits if unsigned, and the bottom 32-bits. I've also got two 32x32-bit divide instructions, one signed and one unsigned. The compiler would love me to have a remainder function, or even a 64x32 divide, but in the ZipCPU architecture those require some software to accomplish. I didn't see any NOOP instruction. That was another afterthought instruction of my. Sure, you could move register A to register A, but such an instruction might stall waiting for A to become available, whereas the NOOP doesn't need to read any instructions. How about that memory access: will your ISA allow 8-bit byte access to memory? I had to come back and add byte and halfword instructions into my ISA as an afterthought, when I couldn't get the C-library to compile without them. While from your description it doesn't sound like you'll struggle from this, I had to wrestle with the realities of linking when I first discovered how a linker worked. There are two basic instructions the linker wants to adjust: load a value into a register, and jump to a location. The first one was fairly easy, I took two instructions and I could load any value into any general purpose register. The second one was harder, but I eventually wrote something similar to what you've described above. I consider this the LOD (PC),PC instruction--or load the value at the next memory address in the instruction stream into the PC. It's the only instruction I have like it, as all my other instructions fit into 32'bit words with no immediate's following. If you are interested, you can see my own instruction cheat sheet here, or a longer discussion of the ISA here. Good luck! Holler if you get stuck, or when you discover you can't get as far in VHDL as I did in Verilog ... Dan
  8. simple wishbone demo to read switches write leds

    @toastedcpu, Here's a link to what I call a "special-purpose I/O" controller. It's wishbone controlled. I've used it in many of my designs, adjusting the number of LEDs buttons, or switches as necessary. It references a debouncer to handle the buttons, as well as a "Knight Rider" demo to move a lit LED back and forth just like "Kit" from the old series. If you are interested in the Arty, you might find this controller interesting. It's wishbone controlled, and controls the LED's of the Arty while also allowing you to read switches and buttons. Further, it will control the color LEDs of the Arty as well via a color LED controller also in the distro. If you are just looking for a simple description of a wishbone slave, you might find this blog post valuable. Dan
  9. FFT issue on ARTY Board

    @train04, Ahm, yeah, you do have some problems with your setup. If your sample rate an your clock rate are the same, then you can hold the data valid line high. This is the easiest to debug. If the sample rate is slower than the clock rate, then you are telling the FFT generator that it can have (clock rate / sample rate) clocks to process each sample. This allows the FFT generator to do things like reusing those expensive multiplies. The problem, though, is if you provide data to the core on every clock after telling it that your sample rate is slower than your clock rate. If you do that, you are likely to corrupt the internals of the FFT. In other words, if you don't know what you are doing then set the sample rate equal to your clock rate to get things to work. (We can adjust this later--once you have the confidence that the FFT works) The bit-reversed output is something I wouldn't expect most people to use. As a bit of a background, most FFT's produce a bit reversed output naturally. This means that (were this an 8-sample FFT) you'd get samples out in the order: 0 4 2 6 1 5 3 7. If that order looks confusing, then look at it in binary: 3'b000, 3'b100, 3'b010, 3'b110, 3'b001, 3'b101, 3'b011, 3'b111 and read the bits from right to left instead of left to right. That's what bit reversal is about. FFT's by nature produce a bit-reversed output. The problem is that most algorithms people develop want the output in its natural order: 0 1 2 3 4 5 6 7. While swapping the order isn't difficult, it does require a buffer the size of two FFT lengths and a bit of decoding logic. While there are a few use cases for a bit reversed output, such as when you only wish to examine a known bin or when you wish to modify the coefficients and then apply an IFFT that accepts a bit-reversed input, the difficulty of using the bit-reversed output really keeps it from being useful. Finally, if you wish to post a picture of a trace ... please label the trace lines something meaningful, rather than probe_0. Thanks, Dan
  10. Feedback on a register file design?

    @Piasa, Fascinating comment. Care to elaborate? Dan
  11. Feedback on a register file design?

    @xc6lx45, I had never heard of anyone actually using this practice before. Some time ago, I was given this article which appears to describe the practice you are recommending above. The individual who had given it to me suggested that ARM got themselves in a lot of trouble (i.e. stuff not working that should have) by using this practice. While I'm not familiar with all the details, I did find the article fascinating--and now more so in light of your suggestion. Dan
  12. Feedback on a register file design?

    @zygot, Oh, I'm around, but you guys have given me a lot of reading material to go through. Well, that and I haven't been waiting on the synthesizer as much so I haven't been hitting reload on the Digilent forum page as often. @CurtP, What @zygot is trying to point out is that I've built my own CPU, the ZipCPU, as a similar labor of love. It's not a forth machine, but a basic sixteen 32-bit register design, with two register sets of that size. It's also small enough to fit on Digilent's CMod S6 while running a small O/S. I'd love to offer you an example register file module, however my own register "file" never managed to get separated out from the rest of the CPU such as you are trying to do. I struggled to find a clean way to do so, and so didn't. If you are curious, you can search the main CPU source and look for "regset" and see how I handled it. In particular, the ZipCPU register file accepts only one write to the register file per clock--not two. I was convinced that two writes per clock would leave the CPU vulnerable to coherency problems--assuming the block RAM's even supported it. This register set supports one write and three reads per clock. Two of those reads are for an instruction, the third is to support the debug port. (You are thinking about how to debug your CPU already, aren't you?) I've also written several instructional blog's on this and similar topics. These cover my view that you should start building a CPU by first building its peripherals, then a debug port to access and test the peripherals, before starting on the CPU itself. Further blog articles discuss how to build a debugging port into the CPU, how to debug the CPU from that port when using a simulator as well as when online. I've discussed pipelining strategies, and presented how the ZipCPU pipeline strategy works. More recently, I've been working with formal methods. I've therefore presented a demonstration of how formal methods can be used to verify that a bus component works as designed, and then offered a simple prefetch as an example. I'm hoping to post again regarding how to build an instruction prefetch and cache, as well as how to formally verify that such a module works, but I haven't managed to clean up my code enough to present it in spite of presenting why such a proof would be so valuable. While I didn't use formal methods to build the CPU initially, I've been finding more bugs using formal methods than I had otherwise, so you might say that I've become a believer. As a result, I'm right now in the process of formally verifying as much of the CPU's modules as I can. I've managed to formally verify three separate prefetch modules, including the one with a cache, the memory access components, the instruction decoder. I've also managed to formally verify several CPU related peripheral components, such as the (yet to be integrated) MMU, counters, timers, an interrupt controller, bus arbiters, bus delay components and more. This has been my current focus with the CPU. Once I finish it, I'm hoping to write about how to use the ZipCPU in case others are interested (and I know they are). I know @zygot dislikes my blog, but you might find a lot of useful information available there to describe the things you've discussed above. Dan
  13. Spartan 3E don't turn on the LCD

    @Juan José, Just to make things clear, you have no problems with the LED's, but rather with the two line LCD display, right? Dan
  14. FFT issue on ARTY Board

    @train04, I can't seem to find a link for the FFT 6.0 core user's guide. If that's the core you are using, then could you provide a link to the user's guide? I found v7.1 and v9.0 (axi based). If you are using the Artix-7, I would expect you would be using v9.0 not v6.0. Next, if you aren't using any buffering, then you'll want the FFT set up in "pipeline" mode. This is the mode that accepts a sampled stream as an input and produces a sampled FFT stream as an output. My guess is that this is your problem right now, and the reason why the trace looks as it does. There is a delay going through the FFT--I think it's about 3 FFT lengths or so, but it's been so long that I'm not sure anymore. Be prepared for that. When you work with a DDS input, you can set the valid line to one. Be aware that you'll get one sample out per clock as well. If you are going to be working with an A/D that will be providing samples at less than full rate (~ 100MHz or so), then you'll want to do something other than connect the valid line to one. Instead, set it to one for one clock period whenever you expect your A/D will have a valid sample. In a similar manner, if the valid signal is perpetually one going into the core, it will also be perpetually one coming out of the core. You'll need to be able to deal with that data rate. You'll probably also want to look into how the core signals the start of a new FFT in its output--that'll be important for you as well. Dan
  15. FFT issue on ARTY Board

    @train04, It might help if you told me what I was looking at. None of the plots are marked. However, judging from the output alone, it looks like you are doing something with the valid signal that doesn't make sense. It's hard to be conclusive, though, since I don't know what the traces are. Dan