• 0

Troubleshooting a possible power issue


Question

Hello, I hope you will be enjoying your vacations if you have been given some.

For me this has meant finally being able to work on my spare-time experiment and finally reach closure on my upgraded design. Let me describe the process.

  1. The ArtyZ7-20 is just the initial prototyping. I'm going to move to real production FPGA boards ASAP (probably in August) but for the time being I'd just want to go ahead with the Arty.
  2. The system is systemverilog RTL and barebone C++.
  3. The initial design was 100Mhz and 6-stage pipe. Vivado estimated about 2.2W power. I suspect it was much lower than that. I ran it over USB2. Let me be clear my mainboard has a fairly beefy usb2 going beyond usual specification.
  4. I later went ahead with a 12-stage pipeline. I was unable to run it on USB2, but it runs rock stable on USB3.
  5. In the last few weeks I've upgraded it to 200Mhz. Vivado now estimates 5.5W.
    1. I never expected this to be able to run on USB3 power (I haven't tried, but I doubt my USB3 can deliver 1A)
    2. so I've hooked the ARTY through its power jack to a industrial supply (details, if needed, in a later message)
    3. The board gives up.

If I leave the board free-running, it hangs almost instantly.

Here's how it goes by stepping it in the debugger:

  1. Booting ok, DHCP fails (ok) and fixed ip is estabilished.
  2. Server correctly found, input requested and correctly received.
  3. Input feed to PL
  4. PL start

As soon as I pass beyond the {4} breakpoint, the card hangs. The debugger will never hit the next breakpoint.

I can tell the thing is more relevant than just software because my PL turns on red LD5 when idle. It would turn on green and eventually blue plus animate LD0-3. This never happens.

I was thinking about hooking a bunch of capacitors to the supply and see if it improves but I guess there might be other issues to consider as well.

Do you have any suggestion?

 

Link to post
Share on other sites

11 answers to this question

Recommended Posts

  • 0

It seems to me that you already have a suspect in mind; a power supply issue. So, why not try to either confirm or eliminate that possibility?

If you have a decent oscilloscope you can monitor the core voltages at the point where you think that the problem occurs. If you don't have a scope then you can use the FPGA XADC facility to monitor and set alarms for internal supply voltaes. This might require a bit of alteration to your PL design. Perhaps you could add control bits to allow some functionality in the PL but not others. This would allow you to selectively enable parts of the design.

You don't mention it but the ARM processors are fast enough to cause but faults. I've seen this with standard AXI GPIO IP and trying to flip bits at too high a rate. You can test this by adding delays between successive AXI read/write accesses.

One thing that is important to do  when debugging complex (SW + HW ) problems is divide and conquer. Pare down the possible causes of the problem. Often, it turns out to be more complicated than just one thing but a binary approach to eliminating suspects is still the best approach most of the time. It never hurts to take a short break and come back and try and do a fresh review of what's going on with a design and try a few thought experiments. Of course, you've already done your PL design simulation and carefully looked for unexpected changes from previous design iterations.. right? Sometimes, the HW timing reports and messages have a clue as well. 

On-board FPGA power supplies are designed to provide a limited range of average and instantaneous power. Trying to change the design behavior by throwing a larger power source or capacitors at it is not a recommended practice. If your prototype platform is limited then you should adjust your prototype performance to suit the platform. Of course, this has as yet to be confirmed.

Oh, and you can instrument your PL design to spit out XADC readings through a HDL UART using 2 spare pins and an external USB TTL UART cable. This will disconnect ARM issues and SW debugging from monitoring the core temperature, voltages, currents etc. In fact, thinking about how to separate HW debugging  from SW debugging by instrumenting the PL is never a bad idea...

Edited by zygot
Link to post
Share on other sites
  • 0

Thank you Zygot, I have difficulty boiling down the suggestions to a concrete course of action.

 

For the time being, I notice this:

3 hours ago, zygot said:

You don't mention it but the ARM processors are fast enough to cause but faults. I've seen this with standard AXI GPIO IP and trying to flip bits at too high a rate. You can test this by adding delays between successive AXI read/write accesses

That's a scary possibility! My design pours out a few monitor signals which are fetched to a not-quite-PWM. If memory serves it should be pulsing at about 200khz. It all goes through FPGA fabric.

Indeed, the system does poll HW through AXI quite frequently. Assuming no bugs it should happen a few times a millisecond but I am inclined to look elsewhere as in my debug runs I am stepping through manually and it still hangs at dispatch, it never even gets to execute the following instruction, let alone the polling.

4 hours ago, zygot said:

Of course, you've already done your PL design simulation and carefully looked for unexpected changes from previous design iterations.. right? Sometimes, the HW timing reports and messages have a clue as well. 

What can I be looking for?

 

Last few notes:

  1. I was not going to add caps at random to something featuring sequencing. I meant to add them straight out of the power supply as I know this supply can't boot 3 rPis reliably and I'm pretty sure the hardware I have sythetized is more power hungry than 1 rPi. I know from previous experimentns the caps can smooth out the dynamic loads enough for it to make it through but that's the best I can reason about.
  2. No, I don't have a scope at home. I don't know where to hook the probes, I haven't even looked it up because I'm not skilled enough at soldering to bring out those traces.
  3. From the Arty7-20 reference manual it seems for FPGA 1V is 2.1A typical, while 3.3 is about 1.5A. I am a bit surprised to find out those are alrady in the right range to give me issues.
  4. I understand using the integrated device could help assuming the issue is related to ARM core communication?
Link to post
Share on other sites
  • 0
2 hours ago, MaxDZ8 said:

have difficulty boiling down the suggestions to a concrete course of action.

  • Try to eliminate the power supply as the main issue
    • implement XADC in PL logic using DRU
    • add a UART to spit out alarm conditions and or voltages. There are some code examples in the Project Vault area. There are some good 3V compatible TTL USB UART cables and breakout boards from Adafruit and Sparkfun. I have 4-5 laying around and in general most are in use on some project or another.
    • I don't know how your Arm cores connect to your HDL design but try and add some enables to parts of your design to help identify what portion might be causing issues.
  • Try and create a separate debug path in your HDL. Obviously the SW debugger is of no help once the ARM core has faulted.
  • What are you looking for in simulation and HW synthesis or P&L messages? I don't know. Doubling your pipeline latency and clock can introduce unexpected design issues. A good simulation testbench can often help identify areas of interest.

The general idea is track down things that you suspect are a problem and divide and conquer parts that might not be obvious areas of problems.

 

 

Link to post
Share on other sites
  • 0

Erm, I understand those things up to a certain point.

I have dumped "fpga DRU" in duckduckgo and found those two links across the garbage.

For "XADC in PL logic using DRU", I have found a little more interesting content here and there but I don't quite connect the dots.

What does that even mean?

 

 

Anyway, I guess it is a good idea to take a step back and start from easier things first I've left out some important information.

The Arty Z7-20 has a magical green led LD12, labeled "done" it turns on after FPGA bitstream has been uploaded. This is supposed to be on basically all the time.

What I observe: LD12 turns off when compute starts.

I'm not deep enough in the tech to understand how this LED is driven but I'm inclined to believe bitstreams must be truly garbled to interfere with it.

 

Another interesting LED is LD13 (RED) described in the reference manual as "on when all the supply rails reach their nominal voltage". This also goes off. The process overall is as follows:

  1. Power on +12V rail through jack. Red LD12 on. Green LD8 on.
  2. Run tera term. Connected to COM.
  3. Run Vitis. I launch the project as Assistant tab > Debug > context-menu > Debug> Single Application Debug
  4. In maybe a second fpga is programmed, LD12 on. Debugger on first line of program. Red LD5 on.
  5. (Not considered relevant: I must run the orchestrator here, it's the program cooking the data for easy consumption)
  6. Canonical ethernet traffic is observed
  7. Debugger hits breakpoint at parameter setup. Continue.
  8. Debugger hits breakpoint at 'start compute'. Continue.
  9. LD12 blinks off.

This behaviour is compatible with running from USB3.

 

I think at this point I'm not sure of how adding UARTs or monitors in either PL or hardened SoC features can help me.

There's still the possibility the PSU might fail in adjusting fast enough. I'm considering taking down the thing with me at work when I can use the scope to monitor 12v voltage in but that's pretty much it.

Link to post
Share on other sites
  • 0
6 hours ago, MaxDZ8 said:

I have dumped "fpga DRU" in duckduckgo

Yeah, sorry about that. I meant to refer to the DRP. You might refer to UG780 which is the XADC User Guide. I think that there might be a few more related references in the Xilinx documentation.

I was just suggesting that you could try and tie a specific core or IO bank voltage drop to your HDL becoming active.

Trying to address the 12V supply that powers the FPGA power supply is unlikely to be of much help. The supply that powers the FPGA is limited to it's design specification and can only supply current to its outputs regardless of the input power that drives it. If you are certain that you are having a supply issue then you would seem to have two choices:

  • reduce the throughput of your HDL design. you've stated that you know where it worked before the current implementation.
  • find a different hardware platform

It's not unusual to scale a prototype design to fit within the capabilities of the prototype hardware you are working with. LD13 is ties to the FPGA supply controller "Power Good" status pin so if this is ever de-asserted while running your application then you have problems that you won't solve with a debugger. LD12 is tied to the FPGA CONFIG DONE pin and if you lose your configuration then the ARM AXI bus is sure to fault as there is nothing in the PL to handshake with.

I was just trying to throw out some suggestions for you to cogitate on.... one never knows what might provide a useful path to get by a roadblock. Personally, I have no problem starting a side project as an effort to solving a seemingly intractable problem. Sometimes these lead to new and fruitful lines of inquiry that I'd not have considered pursuing otherwise. Worst case is that I have a new tool to refer to when I encounter a similar issue.

 

 

Edited by zygot
Link to post
Share on other sites
  • 0

Thank you for the pointers, I have been interested in using XADC for long time so I will take this chance to read UG780.

It is my understanding bare metal apps run on core 0. I will try rebuilding the Vitis project to run on core 1.

I hoped the issue would be about main PSU dynamic response but if both Z7-10 and Z7-20 use TPS65400, then either the -10 variant is hugely overspec'd or the -20 is underpowered. The -20 is thee times bigger!

Nonetheless, I have been trying to run some... something very similar to the old device from weeks ago (it has a few thousand extra flip flops but that's it). Well, it seems I can't get anything really to run on it anymore!

My plan was to move this on A100 but with the ridiculous situation with silicon now, odds are I'll need to wait at least another 2 months! 😥

Link to post
Share on other sites
  • 0

Most of my bigger Series 7 projects use the XADC to monitor and alarm the substrate temperature. It's just not reasonable to expect cheap general purpose FPGA boards to provide the kind of thermal management capabilities that these devices might need for demanding applications. It's also not reasonable to expect an over-designed power supply to meet the requirements of any application. I suspect that even boards with a heat sink and fan like the Genesys2 can get configured with a design that will over tax the core and IO bank supply design. DDR is typically located close to the FPGA and warms up the board and PCB planes quite a bit. While it appears that you've managed to do something that I haven't, which is make the power supply cry uncle, I've certainly seen substrate temperatures venture into the danger zone on both general purpose platforms and custom designed ones as well.

You can use the XADC fairly easily on ZYNQ devices with the core but accessing the XADC through  the DRP in PL logic with a UART provides a means for monitoring temperature and voltages when the cores and the debugger are no longer talking to each other. Of course if you lose your PL configuration that design isn't much use either.

Today's failure isn't necessarily a dead end; it can be an opportunity to learn something new or at least hone a skill. It's really all about your personal curiosity and attitude as to whether failure is a bad thing or a great thing....

 

Edited by zygot
Link to post
Share on other sites
  • 0
19 hours ago, MaxDZ8 said:

My plan was to move this on A100 but with the ridiculous situation with silicon now, odds are I'll need to wait at least another 2 months!

Xillinx branded boards typically have more robust and heftier power supply designs.. at a cost premium. The ZC702 is a Z7020 board with 2 FMC connectors and might work. Unfortunately, these are no longer in production and the ones in distributor stockrooms have undergone a dramatic price increase over what the originally sold for. I don't know it Xilinx sells them directly anymore. The older Ti power module designs though can be a pain when things go wrong and are a bit clunky. I've had experience with this. Always refer to the schematic as part of your pre-purchase analysis.

I do like the idea of going with a non-ZYNQ platform if you don't need the ARM cores. It is indeed frustrating to find out that your purchased platform can't keep up with your project requirements. Doing your due digilence before purchasing hardware is goo practice and, as you've likely found out an expensive lesson. Fortunately, the Vivado tools can help with power estimation though when a design has lots of output pins being driven accurate estimation requires some detailed analysis.

Edited by zygot
Link to post
Share on other sites
  • 0

???

I'm not sure I understand those things. Probably because I have a different mindset.

First, minor thing: XADC user guide is UG480 (four hundred 80).

For the purpose of future readers, I think it is a good idea to document the progress what changed since last time?

On 4/9/2021 at 7:48 PM, MaxDZ8 said:

Nonetheless, I have been trying to run some... something very similar to the old device from weeks ago (it has a few thousand extra flip flops but that's it). Well, it seems I can't get anything really to run on it anymore

It turns in the refactor to time 200Mhz I flipped a bit in the 'start work' functionality which would cause the device to almost never transition back to  ready state. By itself, this caused the CPU code stall. I took the occasion to rework a bit the thing to be more robust so I could at least run something.

 

The data I have acquired today

I can't bother pulling the old, known-to-work design for testing. It was a 6-stage pipeline clocked at 100Mhz with a Vivado estimation of 2118mW (that's accurate because I have it on note). I ran it four hours passively w/o heatsink and it ended clearly above ambient temperature. I wouldn call it even lukewarm.

By now, I have performed a couple extra runs (I have added a small heatsink to the SoC).

  1. 6 stages, 100Mhz, I forgot to pick the estimate but if memory serves it was about 2.3W.  I left it running the whole night and it was definitely warm. A scanned a bit with an IR thermometer and measured 36C at the SoC and 33C at the TPS65400.
  2. 12 stages, 100Mhz. Vivado estimates 3.313W and 63.2C temperature. After about an hour running I measured 45C at the SoC and 39C at the TPS. I would believe +20C  between heatsink and core to be enough of a diff, at that point the core would be at 65C. It starts being uncomfortable but I think there's still more thermal headroom... except... I have measured 58C on the caps between the SoC and LD4. I suspect the thermomether might have been fooled by the shiny surfaces, yet the whole board is definitely lukewarm. The situation seems to be worse on the back side.

I would classify both cases as "rock stable".

Additional considerations

On digikey, TPS65400 is 4.20668€/250 pieces. The arty z7-20 is less than 180 EUR. Considering the design costs, all the other components and assembly I think it is reasonable to think Digilent dimensioned the power supply for the -20 variant and shared the design with z7-10.

The reference seems a bit conservative with the currents. TPS65400 datasheet notes buck3 and buck4 can output 2A on first page. Junction is a hefty 150C.

The Arty Z7-20 schematic notes vcc1v5 and vcc1v8 both at 1.8A which sounds good to me. It seems vcc1v0 is a bit underpowered at 2.6A while vcc3v3 seems to be a lot lower at 1.6A but all things considered I doubt it really even needs that much.

Some simple and most likely wrong numbers, adding the wattage of each of those rails gives me 13.8W which would be 2.76 amps from at 5V. Not impossible but definitely more comfortable on 12V. Notably, the power adapter for PYNQ Z1, which is almost the same as Z7-20 pours out 3 amps! I have no clue how a chip such as 7Z020 can possibly dissipate even just the 13 watts from the TPS but I guess there should be some room.

Random thinking

Which clock should I be feeding to my MCMM/PLL? Is there any chance feeding it the AXI clock can give issues? I honestly don't like how it routes, perhaps there is a better candidate?

 

I need to do more tests.

Edited by MaxDZ8
That was supposed to be a question.
Link to post
Share on other sites
  • 0
4 hours ago, MaxDZ8 said:

For the purpose of future readers, I think it is a good idea to document the progress

I second that motion.

If you open the Xilinx Document Navigator and do a search for XADC you will see quite a few references, some with code examples.

On the subject of your temperature measurements. I also have an IR thermometer as a quick 'safety' check when I connect new external hardware. I do this mostly in case there is bus contention or driving outputs to ground**. As a more general measurement as to how your components are faring when there aren't any defects in a design I would caution against putting too much faith in such measurements. Fortunately, the Series 7 devices all have the capability of measuring substrate temperature using the XADC. This, in my view, is the proper way to assess thermal conditions. All of the components on your board have maximum operating substrate temperature limits that need to be adhered to. 

Let's recap the post so far. As I understand it, your initial observation was that once your PL design started running 2 LED indicators indicating the health of the power supply and FPGA configuration status indicated that both immediately indicated failure conditions. After this your communication between the SDK debugger and the ARM cores stopped. It's certainly reasonable, barring other considerations, to suspect that a drop out of some of the power supply rails precipitated these events. The next step might be to try and prove this hypothesis, and perhaps find a way around the root cause.

The ZYNQ has a number of PLLs on the PS side that generate derived clocks for all of the internal peripherals like DDR controller, Ethernet PHY, UART baud rates and of course AXI bus clocks. You can export these clocks to your PL. You can also use the PL MMCM and PLL clock generators to generate PL logic clocks. There's no wrong way to go as long as you adhere to basic clock domain principals for passing signals between clock domains. There are a lot of ways to do this correctly, but of course there are a lot more ways to do it incorrectly. Most high speed data transfer involves elastic storage, like a circular buffer or a FIFO. If there are two clock domains involved then the FIFO and buffer must have 2 clocks.

The full AXI BUS is not trivial to work with and there are a lot of ways to cause bus faults. Bus faults will certainly terminate your debugger session but I have no idea how they could cause the power supply controller to fail to regulate an output rail or cause the logic to lose configuration. ( I also don't do a lot of ZYNQ design work so I haven't had the need to do some serious investigation into all of the details about those devices... and the interesting details are usually not easy to find in the literature )

I'm not at all surprised by the push-back to my suggestion that expecting a cheap general purpose FPGA board aimed at the educational sector be able to handle the full capabilities of the FPGA devices is unreasonable. The Series 7 devices are really quite capable. What separates expensive commercial or military grade products from something like the Arty Z7-20 is testing and guaranteed specifications. That's what you're paying for when you buy expensive products. It's easy to underestimate the cost of over designing a power supply for a low profit margin product. I don't know of any vendor of general purpose FPGA boards in the educational market that provide a demo project that even attempts to explore the maximum operating conditions of their boards. In fact, it seems to me, that for most of these boards the vendors are banking on the user to create projects that use only a small subset of the external interfaces, and IO, and FPGA resources available and that few will ever need to do timing closure because of high clock rates on most of the logic resources. Beyond simply using a beefier power supply other ancillary costs usually go with enhanced performance, like more PCB layers, heavier copper planes, etc. , etc. Estimating production costs verses profit is a complicated business... and many companies don't do it well.

** You might think that this is the result of bad design processes, and in the end I suppose that you'd be correct. But it's a lot easier to get into these conditions than you might realize. A location constraint might be wrong or ignored by the tools ( always , always check the post route pin assignments) or perhaps you didn't get the timing constraints correct. Sometimes the tools automatically resolve a misunderstanding between your source and what they infer as your intent; and the only indication is a warning, among hundreds of messages, that something is terribly amiss. There are a lot of critters in the swamp of FPGA development waiting to take a bite out of you if you fail to notice their presence...

Edited by zygot
Link to post
Share on other sites
  • 0
6 hours ago, MaxDZ8 said:

I flipped a bit in the 'start work' functionality which would cause the device to almost never transition back to  ready state

This is where simulation can help, especially when making major modifications to a 'rock solid' design.

6 hours ago, MaxDZ8 said:

I can't bother pulling the old, known-to-work design for testing.

I've found that, when not confined to a code versioning system, archiving a project that is at a 'good stopping point' before making changes is a good practice. This make is easy to refer back to a previous known state in the project. Sometimes it just makes sense to keep archived snapshots of just the HDL source. Sometimes, it makes sense to save the project as a new one so that I can open either the old 'working' version or the new 'in progress' version. This is really no different than standard software development except that the FPGA tools create way more intermediate files on the HD. Not everyone works the same way but I thought that this was worth mentioning... you know, for future readers who might find it interesting.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now