• 0
inflector

Non-clocked synchronous circuits

Question

I was reading Dan Gisselquest's blog (aka @D@n), in particular, this specific part of the one that goes into some detail about the ALU for his ZipCPU:

https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/cpuops.v#L317-L343

always @(posedge i_clk)
if (i_ce)
begin
	c <= 1'b0;
	casez(i_op)
	4'b0000:{c,o_c } <= {1'b0,i_a}-{1'b0,i_b};// CMP/SUB
	4'b0001:   o_c   <= i_a & i_b;		// BTST/And
	4'b0010:{c,o_c } <= i_a + i_b;		// Add
	4'b0011:   o_c   <= i_a | i_b;		// Or
	4'b0100:   o_c   <= i_a ^ i_b;		// Xor
	4'b0101:{o_c,c } <= w_lsr_result[32:0];	// LSR
	4'b0110:{c,o_c } <= w_lsl_result[32:0]; // LSL
	4'b0111:{o_c,c } <= w_asr_result[32:0];	// ASR
	4'b1000:   o_c   <= w_brev_result;	// BREV
	4'b1001:   o_c   <= { i_a[31:16], i_b[15:0] }; // LODILO
	4'b1010:   o_c   <= mpy_result[63:32];	// MPYHU
	4'b1011:   o_c   <= mpy_result[63:32];	// MPYHS
	4'b1100:   o_c   <= mpy_result[31:0];	// MPY
	default:   o_c   <= i_b;		// MOV, LDI
	endcase
end else // if (mpydone)
	// set the carry based upon a multiply result
	o_c <= (mpyhi)?mpy_result[63:32]:mpy_result[31:0];

Dan makes the comment:  "Each of the blocks in this figure takes up logic when implemented within hardware. As a result, even if i_op requests that the two values be subtracted, all of the other operations (addition, and, or, xor, etc.) will still be calculated. These other results, though, are just ignored. Thus, on the final clock of the ALU, all of the operations have been calculated, but only the result of the selected operation is stored into the output register." (bold emphasis mine)

I found this a very interesting comment.  Dan shows how there is effectively a multiplexer based on the opcode on the output of each of the logic chains. It strikes me that this is quite a waste of power, in general, so I wondered how, or even if it is possible to do things differently.

Using one of the ZipCPU ops as an example, could one reliably implement something like the following?

always @(posedge i_clk)
if (i_ce)
begin
    do_cmp_sub_op <= (i_op == 4'b0000) ? 1'b1 : 1'b0;

... rest of the op codes go here ...

always @(posedge do_cmp_sub_op)
begin
    {c,o_c } <= {1'b0,i_a}-{1'b0,i_b};
    do_cmp_sub_op <= 1'b0;
end

There are many reasons why this is not something you would do in real life in a CPU, among other downsides it would have the effect of adding extra logic and an additional (at least) 2-clock latency right as the rising edge of the various do_xxxx registers would be one cycle behind, plus you'd need another cycle to turn the clock off so that you could catch a rising edge. So clearly this isn't something one would do for CPU ops that only take 1 cycle.

1) What are the various ways to only have some of the gates / logic working in a system while most of it is quiescent and only run when needed?

2) Does an if statement have the same logic as the case in this respect, i.e. does the logic for i_ce is 1, and  i_ce is 0, also both get run but discarded on the input side of a multiplexer as well?

3) What are the options and tradeoffs involved in deciding what to use as the triggers for logic?
 

Share this post


Link to post
Share on other sites

8 answers to this question

Recommended Posts

  • 1

@inflector,

I see I've gotten you thinking ... that's quite a compliment, so thank you.

In general, my purpose with the ZipCPU has not been to reduce the power usage.  I'm not sure I could compare with such a giant as TI's MSP430 which, if I understand correctly, was rebuilt using special technology within it's gates and logic just to keep it low power.  On an FPGA, you're sort of stuck there.  So ... where else might you look?

As I recall, power cost is measured by every changing logic level within your design.  (I'm not the expert, so perhaps someone might correct me here.)  I think, last I looked, it was something like a constant related to the capacitance of the wire times the number of logic levels transitioning times the core voltage squared, or some such.  This helps to highlight that the faster your clock rate (i.e. the faster things transition) the higher your power will be.  Hence, if your goal is lower power, you will need to 1) fix the inputs so the logic doesn't cause any changes, or 2) drop the clock rate.

I've thought about fixing the inputs for the wishbone bus.  This would have a lot of consequences downstream, as many of the peripherals (memory included) respond in the fashion you notice above--even when they aren't selected.  Fixing inputs would minimize transitions of any of this logic--especially when the bus isn't being referenced.  You can see how I've drafted this change here.  (Look for the ZERO_ON_IDLE define)

I've also thought about creating a sleep state for the CPU that actually suspends the clock and so also lowers power usage.  The CPU would then enter this sleep state any time a user-space program issued a WAIT(for interrupt) instruction.  However, you'd have to be careful about how you did this, since the peripherals that then generate the interrupt would need to continue to be clocked.  Further, you'd want to guarantee that the clock would always run if there was an interrupt pending, etc.  (I think modern PC's do this with some of their co-processors: SSE, MMX, etc, when these aren't used in order to keep their overall power down.  They might even actually power down that voltage rails for those portions of the chips as well.)

I think that answers your first question.

To your second, well .... sort of.  The logic within the ALU is always executed, but the results are only stored if i_ce is high.  You might argue this means that the logic is being calculated for both i_ce and !i_ce, but the !i_ce wires are already present and their values already calculated (with the one exception).

If you are interested in low power, Michael Keating, et al, &nbsp;wrote a text book on the topic titled the "Low Power Methodology Manual: for System on a Chip Design."  You might wish to look this up.  However, having looked over it, my initial review of it was that it applies more to ASIC chip development than to FPGA design--but I think you may still find the principles valuable for understanding what is going on.

Dan

Share this post


Link to post
Share on other sites
  • 1

Using posedge on fabric logic can have a few issues.

If the fabric-generated clock doesn't come directly from a register -- eg if you have c = ( a == b ) -- then you can generate glitches.  It might be that as new values of a,b are being propagated to the logic the condition is met one or more times within a normal cycle.  This can generate short pulses which might trigger some registers but not others.  This is also true for async set/reset logic.

When a fabric generated clock comes from a register or doesn't have glitches, the clock might be ok to use.  There are still some issues.  First, this design style is more prone to generating a larger number of clocks, which might exaust the clock routing for a given clock region.  Second, the clock might will have routing delays that change from build to build as well as over temperature.  This means the clock must be treated as asynchronous to other clocks in the design.

These are not insurmountable issues -- you can create directed routing constraints (DiRt) to ensure the same routing is used each build.  You can ensure safe clock-domain-crossing logic.  However, this requires extra effort in design/sim/constraints.  This is another issue -- that the fabric clocks appear easier to use.  Add to this that they often work fine and they teach novice bad practices.

The fabric generated clocks also can have additional jitter, duty-cycle distortion, etc...  This generally isn't an issue as these clocks tend to be run at Fmax/10 or lower.

 

For the original post, the synthesis tool generally is allowed to optimize the circuit.  It is possible the tools will decide to share adder logic or other logic when it can detect mutual exclusion.  The tools might opt to place the majority of the ALU into a DSP48 slice for example.  

Share this post


Link to post
Share on other sites
  • 1

@inflector,

My apologies, it looks like I missed the thrust of your article initially.

A logic "clock" is defined as the result of some logic calculation being used as the posege or negedge in an always block declaration. Further, the "logic" generated by clocked logic will be valid some time *after* the clock edge.  This means that the timing of this "logic clock" will be separate and distinct from the clock used to generate it--rendering the always block a part of a new clock domain.

My rule for beginners has always been never to use logic clocks.  Several responses to this post included a hearty discussion of both logic clocks and synchronous vs asynchronous resets.  You can read those comments here.

I've looked long and hard to find Xilinx's advice on this.  I think the last advice I found from their staff was, "We're all adults here.  If you know what you are doing, and you know what that means in hardware, go for it."

@Piasa has actually done a nice job of summarizing some of the problems.  1) While flip-flops might only change on a clock, the logic between flip-flops may take some time to settle.  If you use any of this "in-between" logic as a clock to a flip flop, you may find the clock switches erratically, that some of your following logic may see these extraneous clock flips, some might not.  In the end, you get unpredictable results.  2) Logic clocks are not synchronized with the clock of the logic that creates them.  They will always be delayed, and this delay will change across operating conditions.  As a result, you'll need to cross clock domains when moving from the logic clock back to your main clock.  This usually costs several cycles, so it would slow down the CPU.

If you'd like to read some more on the issue, Wikipedia has a nice discussion of metastability, and Clifford Cummings has written a nice paper about what clock domains are, why they are important, and how to cross between them.  While I'm quite proud of my own post on the topic, it's not nearly as extensive of a discussion as Cummings provides.

Hope this helps.  Please write back if not.

Dan

Share this post


Link to post
Share on other sites
  • 0

Thanks for the reply Dan, I've heard the comments about edges on non-clocks but I've also seen plenty of example code where this is done on derived clocks. For example, the sample code for the Digilent PmodCLP (I reference here: 

) includes a count-based microsecond clock rather than 1MHz clock generated from some IP core. The code then goes on to use the posedge for this clock to drive a fair bit of the state machine logic. Lots of example code seems to do this. Which makes me wonder if there are times it is okay, or that I should just not read too much into example code.

And I've not yet found an article which clearly outlines the reasons why using posedge on something other than a real clock is bad beyond a sentence or two description. Most of it seemed to relate to the propagation of the clock signal and the lack of a dedicated routing for any derived clocks.

I like to understand the reasons behind the common wisdom whenever possible. Any recommended reads on this topic? 

Share this post


Link to post
Share on other sites
  • 0

@inflector,

You might find these techniques more reliable than a logic generated clock for handling timing.  Of course, judging by the PModCLP post ... you probably know most of this already, but I thought I might just point it out anyway.  (There's probably someone reading this that doesn't know a better answer, so ... that's what the post is for--to keep me from repeating myself.)

Dan

Share this post


Link to post
Share on other sites
  • 0

Thanks Dan, I read that post sometime last week. It did contain some ideas I hadn't considered before, especially the divide by pi example. And, after reading the various links you pointed me to, I think I'm getting a handle on what can go wrong with fabric-generated logic clocks.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now