Learn how peer-to-peer (P2P) streaming helps significantly improve digitizer system's performance. P2P streaming side-tracks the host PC by enabling direct data transfers between the digitizer and graphics processing units (GPUs) or to storage. This is a huge advantage compared to conventional solutions which require data to be copied via the RAM of the host PC. With peer-to-peer, both the CPU and RAM can instead be used for other tasks. Watch the 3-minute overview video, and read more below.
High-performance digitizers with combined high resolution and sampling rate produce massive amounts of data. For example, the ADQ7 combines 14 bits resolution with a 10 GSPS sampling rate resulting in 20 Gbyte of data per second! This exceeds the capacity of the data link (interface) to the host PC, and data reduction is therefore crucial (fig. 1). Onboard field-programmable gate arrays (FPGAs) help address this problem. These powerful computational resources enable real-time signal processing to reduce the data rate so that it matches the link capacity.
Figure 1. The onboard FPGA helps reduce the data rate so that it matches the link capacity without loss of signal information.
The data reduction can be achieved in many ways, for example:
Massive data reduction can be achieved via the FPGA pre-processing. One such example is the use of real-time waveform averaging on ADQ7 using FWATD. This combination has been used by for example mass spectrometry customers to reduce the output rate from 20 Gbyte/s to 40 Mbyte/s – a reduction of 500 times without loss of signal properties/characteristics!
FPGA pre-processing, therefore, allows for maximum flexibility in the mechanical design. Form factors such as USB 3.0 with seemingly limiting data transfer rates of a few hundred Mbyte/s can still be fully utilized due to the data reduction. This in turn offers additional benefits such as locating the digitizer close to the detector in order to minimize reflections.
Peer-to-peer streaming means that the data is sent to a computational node (for example graphics processing unit (GPU) or disk storage) with little or no involvement of the central processing unit (CPU) or dynamic random-access memory (DRAM) on the host PC. There are three types/levels of data transfer:
Figure 2. Conventional streaming involves writing digitizer data to memory segment S1 in the PC's DRAM (arrows marked "a"), reading from segment S1 to CPU (arrow "b"), writing to segment S2 (arrow "c") and reading from S2 to GPU DRAM (arrows "d"). In total 4 read/write operations which will require 4 x 7 = 28 GByte/s DRAM capability.
Figure 3. With pinned buffer there is only a single memory segment in PC DRAM (denoted "S" above). This solution requires only one write and one read operation and hence 2 x 7 = 14 GByte/s DRAM capability.
Figure 4. Best performance is achieved using peer-to-peer streaming.
One advantage of PXIe is that it supports a large number of hardware units in a single chassis. Multiple high-speed data streams can be set up via the backplane, for example between several ADQ7 units and ADQDSU disk storage units. This ensures optimal performance without loading the PXIe controller (host PC).
FPGAs and GPUs are powerful computational platforms, but they differ in architecture and in the way that they are programmed.
The use of the onboard FPGAs is crucial since it operates on the raw data stream and performs vital pre-processing and data reduction. They offer a high degree of parallelism, but computational resources such as multiply-accumulate (MAC) units are finite, and this can sometimes be limiting. One such example is Fast Fourier transform (FFT), where it can be challenging to implement long FFTs (with many frequency bins) inside the onboard FPGA.
These devices are typically programmed using hardware description languages (HDLs) such as VHDL or Verilog, although so-called high-level synthesis (HLS) can also be used. Without the onboard FPGAs, it would be impossible to adjust the raw data rate to fit the capacity of the data link to the host PC. In general, it is beneficial to perform as much pre-processing and data reduction as possible in the FPGA, but some processing is better done by post-processing in a GPU instead.
GPUs are used with C-programming frameworks such as CUDA or OpenCL. Some consider the programming of such devices simpler than FPGAs since the level of abstraction is higher. With GPUs, the programmer can utilize more high-level data types such as floating-point numbers, whereas FPGAs typically use fixed-point and require much more low-level interaction. GPUs offer shorter development/testing iterations and there are also extensive examples and tutorials available. In contrast, debugging and testing of FPGA designs can be significantly more difficult and time-consuming, and often rely on a combination of detailed simulation and hardware-in-the-loop verification of the device under test (DUT).
GPUs and FPGAs also differ in which type of computations they are suitable for. The former are frequently used in different types of AI and machine learning algorithms for example within medical imaging such as single-cell flow cytometry or swept-source optical coherence tomography (SS-OCT). The post-processing in the GPU is performed on an already reduced data set (via the FPGA) but may still operate on data rates of up to 7 Gbyte/s via peer-to-peer streaming.
Figure 5. GPUs offer faster development iterations.
High-speed streaming to a CPU is possible, but these devices do not offer the same level of parallelism and computational power as GPUs. This type of streaming is therefore typically only a complement to GPU streaming. The CPU is instead better utilized for tasks with lower computational complexity such as system-level control, data display, etc.
Not all systems benefit from real-time/fast data processing. One such example is airborne lidar or radar systems where long flights are conducted to cover large geographical areas. In these systems, it is not crucial to perform computations in real-time, and instead, the data is recorded to storage for subsequent offline processing and analysis.
High-speed storage/recording can be implemented without peer-to-peer technology, but achievable performance depends very much on the system and the workload of the host PC. The transfer rates are may vary significantly and the solution is therefore not very robust. For this type of system, the rates are normally limited to between 1 to 4 GByte/s.
Figure 6. Disk storage unit ADQDSU (second generation) in PXIe format.
Teledyne SP Devices offer turnkey SSD storage solutions that support peer-to-peer streaming. These systems offer more robust operation with stable transfer rates for reliable uptime and performance. Please contact us for further information.
Peer-to-peer streaming is currently supported on devices ADQ7DC, ADQ7WB, ADQ14, and ADQ32. Please contact us if you want to know more about which GPU models from AMD and Nvidia that are supported and under which operating system.
Below is a list of relevant documentation grouped by type.
Associate Professor at Hong Kong University (HKU)
who has implemented a system supporting line scan rates of 10M lines/s
prof. Jakub Čížek, Department of Low Temperature Physics at Charles University, Prague
Associate Professor at Hong Kong University (HKU)
who has implemented a system supporting line scan rates of 10M lines/s
M. Sc. Grzegorz Nitecki, Faculty of Electronics, Military Academy of Technology, Warsaw, Poland
Send us an email, call our headquarters or check our Contact page to find your local sales representative.
Sign up for our newsletter and stay up-to-date on new products and updates.
© 2004 - 2022 Teledyne Signal Processing Devices Sweden AB · Phone: +46 (0)13 465 0600 · Email: [email protected] · Privacy Notice