Design and Evaluation of a Scalable Engine for 3D-FFT Computation in an FPGA Cluster

— The Three Dimensional Fast Fourier Transform (3D-FFT) is commonly used to solve the partial differential equations describing the system evolution in several physical phenomena, such as the motion of viscous fluids described by the Navier–Stokes equations. Simulation of such problems requires the use of a parallel High-Performance Computing architecture since the size of the problem grows with the cube of the FFT size, and the representation of the single point comprises several double precision floating-point complex numbers. Modern High-Performance Computing (HPC) systems are considering the inclusion of FPGAs as components of this computing architecture because they can combine effective hardware acceleration capabilities and dedicated communication facilities. Furthermore, the network topology can be optimized for the specific calculation that the cluster must perform, especially in the case of algorithms limited by the data exchange delay between the processors. In this paper, we explore an HPC design that uses FPGA accelerators to compute the 3DFFT. We devise a scalable FFT engine based on a custom radix-2 double-precision core that is used to implement the Decimation in Frequency version of the Cooley–Tukey FFT algorithm. The FFT engine can be adapted to different technology constraints and networking topologies by adjusting the number of cores and configuration parameters in order to minimize the overall calculation time. We compare the various possible configurations with the technological limits of available hardware. Finally, we evaluate the bandwidth required for continuous FFT execution in the APEnet toroidal mesh network.


I. INTRODUCTION
Continuous demand for efficient computing power is pushing designers into integrating dedicated hardware components in High-Performance Computing (HPC) architectures to improve computational efficiency. Generalpurpose CPUs can delegate specific tasks to the hardware accelerator decreasing the latency of computationally demanding task such as the training of neural networks [1]- [4], video or audio processing [5]- [8], environmental forecasting [9]- [11], security algorithms [12], automotive applications [13], etc. In other scenarios, hardware accelerators are used to reduce system power consumption [14]- [15].
Modern HPC systems are evaluating the inclusion of FPGAs as components of their system architectures because they can combine effective hardware acceleration capabilities and dedicated communication facilities in a single device. The resulting design is suitable to execute distributed tasks in computer clusters effectively. An actual example is Microsoft Catapult [16], a data center able to act as an HPC system thanks to the introduction of FPGA accelerators. In this context, FPGAs also allows optimizing the data exchange among the hardware acceleration modules thanks to the direct connections supported by the network controllers which are integrated into the programmable hardware.

A. FFT for simulations
A widely used algorithm in the simulations of physical phenomena is the Multidimensional FFT and in particular the 3D FFT [17] that are employed in solving the partial differential equations of physical models, such as the Navier-Stokes equations that describe the motion of viscous fluids [18] or Newtonian mechanics equations of Molecular Dynamics [19]. Simulation of such problems requires the use of HPC since the size of the problem grows rapidly and this is mainly due to three reasons: • The dimension of the problem increases with the cube of the FFT size; • The representation of the single data point value typically requires a double precision floating point complex number (i.e., 32, 64 or 128 bits); • The single point usually comprises different  components (for example, if the point represents the  velocity it has three components per point). Fig. 1 reports the required RAM vs. the FFT size N for different computer cluster configurations, starting from the single unit, up to 1024 nodes (a 32x32 bi-dimensional cluster). The required RAM per node is compared with the RAM sizes of the FPGAs development kits currently available on the market (512 MB, 2GB and 8GB). The figure shows that the single unit needs handling about 0.25 GB for N = 256 or up to 1024 GB for N = 4096. This implies that typical physical problems have to be tackled on HPC computer clusters where multiple compute units work in parallel to distribute the data domain. For example, considering an FPGA with 8GB of local RAM such as the Xilinx UltraScale+ VU37P, we can execute an N = 8192 points 3D FFT only using 1024 FPGAs, i.e. a 32x32 cluster.
To better quantify the impact of the FFT in physics simulations, we have performed a computation of the Navier-Stokes equations for a 1024x1024x1024 volume on an increasing number of processors. The computation was performed by the NewTurb cluster [20]. In Fig. 2 we report the percentage of execution time dedicated to the FFT and inverse FFT computation and the percentage of the other computing activities. As can be noted, the percentage of time dedicated to the calculation decreases with the number of processors while the percentage of FFT computing time increases.

B. Related Work
Over recent years, FPGAs have been employed for several different applications such as the Internet of Things [21]- [23], wireless communications [24]- [25], network protocols [26]- [27], etc. The employment of FPGAs in HPC is relatively recent and still the subject of research and optimization. FPGAs have been initially conceived to implement custom networking infrastructure among the nodes of the cluster [28]. Only recently, the proposal of FPGAs as a hardware accelerator has appeared in some works. In [29] the authors evaluate the effectiveness of an OpenCL FPGA implementation of several algorithms and point out that FPGAs can be more energy-efficient than GPUs and sometimes faster. The integration of FPGAs in HPC clusters in the field of 3D FFT has been discussed in [31], showing a distributed parallel implementation of a problem sized of up to 128 3 points along with a 3D-torus direct network of 8 3 FPGAbased nodes. In terms of solution time, results clearly state the viability of FPGA clusters in terms of strong scaling. Moreover, in [32] it is demonstrated how the interaccelerator dedicated network (in this case each FPGA has a single 40Gbps link) gives a sensible increase of overall performances on the 3D FFT, even compared to a GPU cluster running cuFFT.

C. Main goals
In this paper, we explore a possible evolution of the HPC APEnet system. The APEnet network connections are designed to support the distributed processing effectively and are implemented by FPGA cards. Considering that modern FPGAs have the considerable processing power, the paper investigates the possibility of using the networking FPGA as a hardware accelerator for computing of the 3D FFT. For this reason, a Radix-2 module will be designed for complex double-precision numbers. Also, we design a scalable architecture for the mono-dimensional FFT computation that integrates the Radix-2 module and can be adapted to the various possible network topologies. Finally, the required bandwidth in the APEnet toroidal mesh network will be calculated to support real computation.

A. APEnet
The APEnet project dates back to 2004 [28] and aims at developing a network fabric, which allows assembling an HPC, cluster with direct network topologies with off-theshelf components. Over the years several card families have   Analyzing Eq. (1) it can be noted that the 3D FFT can be computed as a nested sequence of monodimensional FFTs. The order in which the three dimensions are scanned is not relevant. For example, the FFT can be computed along the X direction then the transformed data can be worked on again along the Y direction and finally along the Z direction. Every transformation requires N 2 monodimensional FFTs. After the 3D transformation, data is used to perform the required processing and then antitransformed again. Thus the entire compute step involves 6xN 2 monodimensional FFTs.

C. Implementation of the monodimensional FFT
The Cooley-Tukey FFT algorithm is the most commonly used to calculate the DFT and the inverse DFT. It is a divide and conquers algorithm that recursively breaks down a DFT into many smaller DFTs. The algorithm can be applied in two versions: the decimation in time (DIT) or the decimation in frequency (DIF). In the following, we are going to use the DIF version.
The mono-dimensional FFT can be divided as follows: where in the last step we used 1 . If we consider the even and the odd separately, we obtain: where ( 0,1, … , /2 1. As can be noted, the proposed formulation allows dividing the even and the odd part of the FFT thus reducing its size of the FFT by a factor 2 every step. The simple kernel that can be defined from eq. (5) and eq. (6) is depicted in Fig. 3 and is usually called "butterfly". The butterfly kernel combines the sample k and the sample k+N/2 to produce two outputs, which can be processed separately in the next step. The complete FFT can be built from this basic butterfly kernel, building a structure of N/2 butterflies at each stage for a total of 012 stages. In Fig. 4 we show an example of an 8-point FFT. Each stage includes /2 4 butterfly units and the total number of stages is 012 3. Each butterfly unit operates on a pair of results from the previous stage. The distance between the two samples selected decreases by a factor two every stage. Finally, the output values have to be reordered.

A. FPGA Radix-2 Implementation
In this section, we detail the design of the Radix-2 module depicted in Fig. 3 that will be used as the main building block for the full monodimensional FFT implementation. Writing explicitly the real and imaginary parts of the equations representing the inner butterfly FFT operations, we have: , , … , 3  where and < represent the butterfly inputs, and < the outputs and W is the twiddle factor. A direct hardware implementation of these equations for FPGA is reported in Fig. 5, where the complex operations are broken up in double precision adders and multipliers operating on the real and the imaginary components.  To maximize data throughput, the FPGA uses concurrent operations, for a total of six adders and four multipliers.
Other approaches are possible, as reviewed in [34], targeting a specific optimization, such as low power consumption [35], low area usage [36], or real-valued only signals [37]. In the proposed implementation, each operator has its characteristic (and parametric) internal latency and a streaming interface (in input and output ports), and they are typically available from IP libraries of the FPGA vendors, even in a double precision format with full IEEE-754 standard compliance. Thus, a balanced chain of operators perfectly fits into a pipelined architecture, where after the initial latency FFT outputs are available at each clock cycle.
Computing is organized in three stages. In Stage A, the following operations are performed:

(14)
:; W and ℑ> W are delivered to stage B by a register chain with the same length of the latency of the adders.
In Stage B, the following operations are performed: A and A B are delayed to the output stage on register chains. Finally, in Stage C, the following operations are performed: and the results are delivered to the output. Among these stages, data is registered to ease resource placement and achieve a higher frequency. Adders have a programmable latency among 0 and 14 clock cycles and multipliers among 0 and 12, and they are all configured to allow operands to be applied on every clock cycle (one cycle per operation). The computational complexity of this radix-2 butterfly implementation is 10 Floating Point operations per cycle. Defining the latency of the three stages 0 F , 0 G and 0 H , then the total latency of the butterfly operation block is: where the constant value 4 is the number of registration cycles among the 3 stages. Varying 0 IJI has an impact on working clock period K L , thus making it an adaptable parameter.

B. Scalable FFT Implementation
The full hardware of the monodimensional FFT includes a matrix of /2 • 012 Radix-2 modules and 012 swap modules as shown in Fig. 6. The full implementation is clearly the most efficient from the computational point of view, because it can theoretically allow the computing of one monodimensional FFT for clock tick K L . Fig. 6 Full monodimensional FFT implementation However, the hardware components required for a full implementation rapidly exceed the available resources in the current available FPGAs, and thus alternative designs have to be considered. Moreover, since data is spread over the whole cluster to parallelize the computation, therefore, nodes periodically need to exchange data, the required bandwidth among them turns out to be too high for the available communication technologies so that the processors would spend most of the time idling.
To decrease the total number of radix-2 computing units required for the single FFT, we consider two strategies: i) to reduce N, the number of columns with N ≤ 012 or ii) to reduce P , the number of rows in the computation matrix with P ≤ /2. To explain the basic design required by the two solutions, in Fig. 7 and Fig. 8 we depict the two extreme cases: N 1 and P 1 respectively.
The single column solution uses /2 radix-2 units reused 012 times (see Fig. 7). Thus, the total FFT execution time results inK L 012 . More generally, if the number of columns of the FFT is N , the FFT execution time isK L 012 /N. Considering the single row solution in Fig.  8, we have that every FFT step needs repeating N/2 times. Moreover, to feed the next step we need to wait for all the /2 computations buffering the intermediates values. The total resulting execution time in case of P rows results in K L /2 /P. It is possible to implement a general solution with P rows and N columns. In this case, the total FFT execution time is:

C. Radix2 Performance
The proposed double precision FFT engine in the case of P 1 and N 012 has been implemented on a Xilinx Virtex Ultrascale+ VU37P device for full synthesis and simulation with Vivado 2018.3 version. In Table I,   We can observe good scalability in terms of Slice LUTs, Slice Registers, and DSP blocks. Block RAM usage is relatively high. It can be further optimized or to reduce the memory impact when scaling to larger N, a possible solution is to use different types of memory resources, such as distributed memories or UltraRAM blocks.
In terms of energy efficiency, we estimated the power consumption using Xilinx Power Analyzer tool: the single FFT engine implemented is capable of 4.62 GFLOPS/Watt for the 1024 case and 5.64 GFLOPS/Watt for the 4096 case. This preliminary result indicates that the FPGA can be considered a viable solution also for highly power efficient computing. Fig. 9 Example of 2D decomposition with P processing nodes a) in the X direction, b) in the Y direction and c) in the Z direction.
In Table 2 we report the results of several synthesis attempts with different floating-point operators' latency. We note that increasing the overall butterfly latency; we can achieve a higher working frequency of the engine. In our scenario, the total execution time is more affected by the working frequency rather than engine latency.

D. Network required capacity
To perform a distributed 3D FFT, data has to be divided among the available processors that compute the monodimensional FFTs. The most efficient decomposition strategies are the 1D and the 2D [38]. The cluster processing efficiency depends on the number of processors available and the size of the problem: if the nodes are relatively few it is better to use the 1D strategy; on the contrary, if the number is high the 2D strategy becomes convenient. In both decompositions, the nodes process the local data and then need to exchange a part of to compute the FFTs in along other dimensions.
Since we are targeting large clusters, in the following we assume to be using the 2D partition strategy that is depicted in Fig. 9. Each node has to compute a pencil of the full cube in the three directions X, Y, and Z. The processing nodes are typically arranged in a 2D network, and we assume a switchless torus topology typical of parallel computer systems [33].
In the 2D decomposition, the data is only exchanged among the nodes of the same row in the XY transposition and the nodes of the same column in the YZ transposition [38]. If we use a total of S processing units in the 2D cluster, each node executes /√S FFTs and needs exchanging 16 /√S bytes with every other node on the row/column for every FFT (where 16 bytes = 2*64 bits is the representation of the single point). Since we have √S processors per row/column, the total amount of bytes that we need to exchange in a single link in the torus network is 16 √S/2 for every FFT.
In Fig. 10 we report the network throughput of the single torus link with increasing S , the number of processing elements for various characteristics of the FFT engine. We also report two link bandwidth capabilities available on the target FPGA device.
We note that increasing the engine dimension, in terms of R and C, we quickly ramp up the required bandwidth to values not compatible with current link technologies so that communication delay dominates the total computation time. We presented a scalable FFT engine suitable for HPC cluster architectures that employ FPGAs as hardware accelerators and network communication. The FFT engine is based on a radix-2 double precision floating point module, specifically designed for the FFT computation. A single row version of the core has been implemented and synthesized in a Xilinx Virtex Ultrascale+ VU37P to evaluate the achievable performances. The results have been used to estimate the link bandwidth capacity required for a 2D torus network.

ACKNOWLEDGMENT
We would like to thank the group of Luca Biferale for the use of the NewTurb cluster.