Peer-to-Peer Streaming

Learn how peer-to-peer (P2P) streaming helps significantly improve digitizer system's performance. P2P streaming side-tracks the host PC by enabling direct data transfers between the digitizer and graphics processing units (GPUs) or to storage. This is a huge advantage compared to conventional solutions which require data to be copied via the RAM of the host PC. With peer-to-peer, both the CPU and RAM can instead be used for other tasks. Watch the 3-minute overview video, and read more below.

Table of Contents

Pre-processing using field-programmable gate array (FPGA)

High-performance digitizers with combined high resolution and sampling rate produce massive amounts of data. For example, the ADQ7 combines 14 bits resolution with a 10 GSPS sampling rate resulting in 20 Gbyte of data per second! This exceeds the capacity of the data link (interface) to the host PC, and data reduction is therefore crucial (fig. 1). Onboard field-programmable gate arrays (FPGAs) help address this problem. These powerful computational resources enable real-time signal processing to reduce the data rate so that it matches the link capacity.

Data reduction in the onboard FPGA is crucial

Figure 1. The onboard FPGA helps reduce the data rate so that it matches the link capacity without loss of signal information.

The data reduction can be achieved in many ways, for example:

  1. Triggered acquisition of a user-defined number of consecutive samples (so-called records). Data reduction is achieved by only transferring the records while the rest of the data is discarded. This functionality is supported by our firmware options FWDAQ and FWPD.
  2. In frequency-domain applications, it is common with digital down conversion that combines filtering and decimation to achieve data reduction. This is supported by FWSDR (for ADQ14) and FW2DDC (for ADQ7).
  3. Application-specific data reduction based on known information about the acquired signal. For example real-time averaging of known repetitive signals, or extracting signal characteristics from time-domain pulses. This is supported by the firmware options above as well as FWATD.
  4. Custom data reduction can be implemented using the firmware development kit. This offers full flexibility and can either be implemented by the customer or by utilizing design services offered to OEM customers.

Massive data reduction can be achieved via the FPGA pre-processing. One such example is the use of real-time waveform averaging on ADQ7 using FWATD. This combination has been used by for example mass spectrometry customers to reduce the output rate from 20 Gbyte/s to 40 Mbyte/s – a reduction of 500 times without loss of signal properties/characteristics!

FPGA pre-processing, therefore, allows for maximum flexibility in the mechanical design. Form factors such as USB 3.0 with seemingly limiting data transfer rates of a few hundred Mbyte/s can still be fully utilized due to the data reduction. This in turn offers additional benefits such as locating the digitizer close to the detector in order to minimize reflections.

Peer-to-peer streaming

Peer-to-peer streaming means that the data is sent to a computational node (for example graphics processing unit (GPU) or disk storage) with little or no involvement of the central processing unit (CPU) or dynamic random-access memory (DRAM) on the host PC. There are three types/levels of data transfer:

    1. In a conventional setup, each piece of hardware in the system has a separate driver which connects to the user’s application. The drivers are assigned separate memory spaces in PC DRAM, and therefore data transfer between the hardware units requires significant copying. This results in a heavy load on both CPU and DRAM in the host PC. For example, streaming data in 7 Gbytes/s means 28 Gbytes/s load on the PC DRAM (fig. 2).
Conventional streaming suffers from poor performance

Figure 2. Conventional streaming involves writing digitizer data to memory segment S1 in the PC's DRAM (arrows marked "a"), reading from segment S1 to CPU (arrow "b"), writing to segment S2 (arrow "c") and reading from S2 to GPU DRAM (arrows "d"). In total 4 read/write operations which will require 4 x 7 = 28 GByte/s DRAM capability.

    1. One way of improving the performance is to share the memory between the drivers. This method is called pinned buffer, and it effectively reduces the required copying to half so that for example streaming of 7 Gbytes/s results in 14 Gbytes/s load on the PC DRAM (fig. 3). This method can be sufficient given that the PC provides sufficient memory bandwidth. However, a drawback is that all the hardware drivers (including third-party drivers) need to support shared memory and that is not always the case.
Pinned buffer improves the performance signficanty

Figure 3. With pinned buffer there is only a single memory segment in PC DRAM (denoted "S" above). This solution requires only one write and one read operation and hence 2 x 7 = 14 GByte/s DRAM capability.

    1. The best performance is achieved using peer-to-peer (P2P) streaming. With this method, the data is sent directly between the digitizer and endpoints via a PCIe switch (or root complex) with little or no involvement of the CPU or DRAM on the host PC (fig. 4). This significantly reduces the workload on the CPU and DRAM. P2P streaming is currently supported on both Windows and Linux for ADQ7 and ADQ32 and Windows only for ADQ14.
Best performance is achieved with peer-to-peer streaming

Figure 4. Best performance is achieved using peer-to-peer streaming.

One advantage of PXIe is that it supports a large number of hardware units in a single chassis. Multiple high-speed data streams can be set up via the backplane, for example between several ADQ7 units and ADQDSU disk storage units. This ensures optimal performance without loading the PXIe controller (host PC).

Streaming to GPU

FPGAs and GPUs are powerful computational platforms, but they differ in architecture and in the way that they are programmed.

The use of the onboard FPGAs is crucial since it operates on the raw data stream and performs vital pre-processing and data reduction. They offer a high degree of parallelism, but computational resources such as multiply-accumulate (MAC) units are finite, and this can sometimes be limiting. One such example is Fast Fourier transform (FFT), where it can be challenging to implement long FFTs (with many frequency bins) inside the onboard FPGA.

These devices are typically programmed using hardware description languages (HDLs) such as VHDL or Verilog, although so-called high-level synthesis (HLS) can also be used. Without the onboard FPGAs, it would be impossible to adjust the raw data rate to fit the capacity of the data link to the host PC. In general, it is beneficial to perform as much pre-processing and data reduction as possible in the FPGA, but some processing is better done by post-processing in a GPU instead.

GPUs are used with C-programming frameworks such as CUDA or OpenCL. Some consider the programming of such devices simpler than FPGAs since the level of abstraction is higher. With GPUs, the programmer can utilize more high-level data types such as floating-point numbers, whereas FPGAs typically use fixed-point and require much more low-level interaction. GPUs offer shorter development/testing iterations and there are also extensive examples and tutorials available. In contrast, debugging and testing of FPGA designs can be significantly more difficult and time-consuming, and often rely on a combination of detailed simulation and hardware-in-the-loop verification of the device under test (DUT).

GPUs and FPGAs also differ in which type of computations they are suitable for. The former are frequently used in different types of AI and machine learning algorithms for example within medical imaging such as single-cell flow cytometry or swept-source optical coherence tomography (SS-OCT). The post-processing in the GPU is performed on an already reduced data set (via the FPGA) but may still operate on data rates of up to 7 Gbyte/s via peer-to-peer streaming.

GPU peer-to-peer streaming

Figure 5. GPUs offer faster development iterations.

Streaming to CPU

High-speed streaming to a CPU is possible, but these devices do not offer the same level of parallelism and computational power as GPUs. This type of streaming is therefore typically only a complement to GPU streaming. The CPU is instead better utilized for tasks with lower computational complexity such as system-level control, data display, etc.

Streaming to Solid-State Disk (SSD)

Not all systems benefit from real-time/fast data processing. One such example is airborne lidar or radar systems where long flights are conducted to cover large geographical areas. In these systems, it is not crucial to perform computations in real-time, and instead, the data is recorded to storage for subsequent offline processing and analysis.

High-speed storage/recording can be implemented without peer-to-peer technology, but achievable performance depends very much on the system and the workload of the host PC. The transfer rates are may vary significantly and the solution is therefore not very robust. For this type of system, the rates are normally limited to between 1 to 4 GByte/s.

ADQDSU disk storage unit

Figure 6. Disk storage unit ADQDSU (second generation) in PXIe format.

Teledyne SP Devices offer turnkey SSD storage solutions that support peer-to-peer streaming. These systems offer more robust operation with stable transfer rates for reliable uptime and performance. Please contact us for further information.

Additional resources and downloads

Peer-to-peer streaming is currently supported on devices ADQ7DC, ADQ7WB, ADQ14, and ADQ32. Please contact us if you want to know more about which GPU models from AMD and Nvidia that are supported and under which operating system.

Below is a list of relevant documentation grouped by type.

Videos

Datasheets

Application Notes

User Guides and Manuals

White papers, case studies, and other

  • "The technical and intellectual support from the team at Teledyne SP Devices has been playing an important role in our research."

    Associate Professor at Hong Kong University (HKU)
    who has implemented a system supporting line scan rates of 10M lines/s

  • "I can state that ADQ7DC is the best digitizer for high resolution positron lifetime spectroscopy I found on the market."

    prof. Jakub Čížek, Department of Low Temperature Physics at Charles University, Prague

  • "The technical and intellectual support from the team at Teledyne SP Devices has been playing an important role in our research."

    Associate Professor at Hong Kong University (HKU)
    who has implemented a system supporting line scan rates of 10M lines/s

  • "The ADQ7DC digitizer is the best device of this type available on the market with high sampling rate, wide analog bandwidth, quality and stability of signal acquisition. A professional team of the SP Devices engineers ensure support and quick response to the inquiries."

    M. Sc. Grzegorz Nitecki, Faculty of Electronics, Military Academy of Technology, Warsaw, Poland

  • CONNECT WITH US

Send us an email, call our headquarters or check our Contact page to find your local sales representative.

Call Us Email Us

  • NEWS FROM US

Sign up for our newsletter and stay up-to-date on new products and updates.

Sign Up For Newsletter Technical Newsletter ISO logo

© 2004 - 2022 Teledyne Signal Processing Devices Sweden AB · Phone: +46 (0)13 465 0600 · Email: [email protected] · Privacy Notice