Jump to content
  • 0

FPGA with really much RAM


Matthias92

Question

Hi everyone,

this is my very first question on your forum :)

I'm new to the FPGA topic and this week I struggled to evaluate how difficult the following project will be:
We have a motion capture system tracking a hand and driving a very complicated levitation device. The project should work with as little delay as possible. At the moment it is about 17 ms and the target is to reduce this to around 7 ms. Most of the latency comes from the GigE connected cameras, sampling at 200 Hz, but also from the operating system. Because of the complex computation, the difficult part is precomputed as a very big lookup table (8 GB) present in memory. To reduce latency, we want to work 'bare metal' and later on , eliminate the lookup table and use high parallelized code  to drive 128 devices at 50kHz frequency.

What I planned so far: Using the existing cameras would require a low latency system handling image processing (stereo camera registration and key point tracking). I know that these would be efficient to be implemented on an FPGA. To address 4 Optitrack Prime 13 cameras, the NetFPGA-1G-CML Kintex-7 FPGA Development Board looks very promising. Can somebody estimate how difficult it will be to extract images of a GigE Camera with the Vivado Studio will be?

The second part is frustating: I do not know how to add DDR3 RAM from a laptop to this setup. Is it possible to add an adapter to the FMC and use the MIG to configure the Interface? I tried to search for this but only found boards with SO-DIMM sockets or RAM -Chips presolderd. The first are far to expensive and the second have not the required capacity.  I only used SPI and I2C on a microcontroller so far, therefore interfacing ethernet phy or RAM programmatically and especially physically is still a mystery to me.

The third problem is optional: the target device is interfaced via USB 2 and drivers only exist for Windows. It is not easily possible to communicate directly with an FPGA in this scenario is it?

In the end, I want to use high level programming, like the Vivado Studio or Simulink. The project is financially limited, but around 2000 Euro would be adequate, my professor told me.

I am thankful for all constructive advices, comments. literature and questions. Please tell me your opinion, if an FPGA will be the right choice, the project is manageable and/or if there is a better solution.

Best regards from Germany,

Matthias Popp

Link to comment
Share on other sites

3 answers to this question

Recommended Posts

Hello Matthias,

When you move from a software prototype to embedded hardware, you first need to think about how the implementation changes. Embedded systems have different performance/latency profiles and data access patterns are very important. See if you really need 8GB of memory for your algorithm. If yes, an FPGA board with SODIMM slot is your only viable choice. High-Bandwidth Memory is still expensive.

I would start by researching the feasibility of your algorithm in hardware. Get to know FPGA architecture and write prototypes either in VHDL or HLS.

The interfaces are secondary to algorithm. For Ethernet you will need a MAC and a microprocessor running an IP stack inside the FPGA. See if the latency of a Microblaze+lwIP combo is acceptable. Otherwise, you are left with implementing the IP stack in hardware that extracts the image data and spews it into memory. Similarly, for USB you will need a host controller. You either license an IP for FPGA or go with one of the ARM+FPGA hybrids (Zynq).
The Zynq has both Ethernet MAC and USB controller included as hard cores, so you would only need to implement your processing algorithm in FPGA.

 

Link to comment
Share on other sites

@Matthias92,

If your goal is realtime operation, an FPGA is usually a much better choice than many microprocessors.

As @elodg recommended, consider adjusting your algorithm.  A levitation tracking algorithm doesn't usually need the whole video image.  Consider cropping and downsampling that image to what you need.  16x16 might well be overkill for the problem, although I don't know your entire setup to be certain.  If you do that, you should then be able to drop your latency tremendously while still working at the full precision of the camera.

Dan

Link to comment
Share on other sites

Thanky you for your responses!

@elodg I talked with the reseachers currently working on the algorithm, if it is possible to compute it on the fpga or reduce the footprint of the lookup table. Unfortunately both are impossible. It is an approximation to an NP-hard problem done with a heavily sequential numeric solver. The only possibility would be to reduce accuracy and use symmetrics of the table to compress it. This would achiev about a ratio of 2:1 to 3:1  - so it is still a lot of space needed. 
But I like the idea of trying different approaches than throwing more RAM against the problem: I noticed that successive queries are in 60% of all times also very close in the table (at most 1Mb offset) . My next reseach will be if I could use SD Cards instead of RAM and cache relevant entries in the RAM. But I fear that the communication between FPGA and OS will be a nightmare. The  Parallella Board with Xilinx Zynq 7020   looks promising to me.

I will change my plan of using GigE cameras and will design my own Stereo Camera Board based on this one and connect it to the Parallella Board. If I disable the HDMI which is implemented using the FPGA part of the Z7020 I am left with 16 LVDS pairs - enough for 2 of the Python 2000 CMOS  sensors at full speed.
There exists demo projects for this configuration with a similar SoC, even with face tracking!

Suitable SD Cards seems to have read latency around half to a full millisecond, which should me fine if the rest of the program does add too much delay.
This is where the recommendation of @D@n  comes in. I will investigate if it is possible and how easy it is to work on the incoming pixel stream of the incoming picture. Kernel functions like the sobel filter (edge detection) or other object detection algorithms like Viola–Jones object detection, should get along without a buffered full frame. Hopefully the small FPGA has enough space to implement these algorithms.

You see that the setup is not certain at all.
Thank you for your expertise. If you would like to comment or criticize feel free to do so! I hope you are will answer on further questions again.

Best regards,

Matthias  Popp

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...