选自ASC26赛题,AMSS-NCKU程序是我国首个自主开发的数值相对论计算程序,基于有限差分方法与自适应网格加密技术,适用于双黑洞合并等强引力场系统的模拟。
Numerical Relativity Code AMSS-NCKU Optimization
Before detailing the experimental setup, we first clarify several key concepts. The term “full scale” is defined by a Final_Evolution_Time of 1000.0, while the “test scale” is defined by a Final_Evolution_Time of 5.0. To streamline performance comparisons, any reference to time implies the computational cost of a single evolutionary iteration step, unless explicitly stated otherwise. For the present analysis, the 64-process results are utilized as the baseline (equivalent to 20 seconds per iteration step), a choice that will be justified later in the text.
Topic Background
General Relativity (GR), formulated by Einstein, describes gravity as the geometric curvature of spacetime and serves as the theoretical cornerstone of modern astrophysics and cosmology. Its core governing equations—the Einstein Field Equations (EFEs)—elucidate the nonlinear coupling between matter distribution and spacetime structure. In the era of gravitational-wave astronomy, the research value of GR has become increasingly prominent. Whether driven by the breakthrough observations of the Laser Interferometer Gravitational-Wave Observatory (LIGO) or China’s ongoing Ali Primordial Gravitational Wave Detection project (aimed at exploring the frequency band), tasks such as signal identification, parameter estimation, and the analysis of physical mechanisms rely heavily on accurate theoretical modeling and numerical simulations of extreme astrophysical processes, particularly binary black hole (BBH) mergers.
In the highly dynamic, strong-field regime of BBH mergers, extreme spacetime curvature and relativistic velocities cause the nonlinear terms of the EFEs to dominate, rendering analytical approximation methods (such as post-Newtonian approximations and perturbation theory) invalid. Consequently, the discrete iterative schemes of Numerical Relativity (NR) emerge as the only viable approach for accurately solving the EFEs in these scenarios. Currently, mainstream technical frameworks typically rely on the BSSN formulation or the CCZ4 constraint-damping system under the ADM decomposition to ensure numerical stability. These frameworks utilize high-order finite difference methods coupled with Runge-Kutta time integration for discretization. Furthermore, to bridge the vast scale disparity between the black hole horizon singularity and gravitational wave wavelengths, core algorithms widely integrate Berger-Oliger-style adaptive mesh refinement (AMR) and employ MPI parallel communication strategies based on domain decomposition. This architecture enables efficient numerical simulations of extreme compact object coalescences, such as BBH mergers, on supercomputers.
However, these mainstream techniques still face severe challenges when addressing the demands of ultra-high precision and large-scale simulations. The low arithmetic intensity of finite difference operators is heavily bound by memory bandwidth, making it difficult to fully exploit the peak floating-point performance of modern hardware. Within the Berger-Oliger AMR architecture, frequent inter-level interpolations during mesh regridding induce massive MPI communication overhead, making the code highly susceptible to the “communication wall” bottleneck during large-scale parallelization. Additionally, load imbalances caused by dynamic mesh evolution, coupled with the complexity of porting intricate control logic to heterogeneous accelerators, constitute the core technical barriers limiting the scalability and computational efficiency of NR codes.
This work focuses on the in-depth performance optimization of the open-source numerical relativity code AMSS-NCKU, aiming to resolve computational bottlenecks in high-precision spacetime evolution. By fine-tuning key modules—including the finite difference computational kernels, AMR, and MPI parallel communication—we employ a comprehensive suite of optimization strategies. These include the decoupling of evolution-analysis loops, GPU heterogeneous acceleration, process affinity binding, communication aggregation optimization, 3D interpolation algorithm improvements, recursion elimination, and compiler optimization flag tuning. The ultimate objective is to minimize total simulation time, thereby solidifying numerical relativity’s position as a quintessential cutting-edge application in high-performance computing (HPC) and providing a highly efficient computational engine for future large-scale gravitational wave data analysis.
Clusters Configuration in AMSS-NCKU Problem
The configuration of the computational cluster we used is shown below.
| Component | Specification |
|---|---|
| CPU | Intel Xeon Gold 6348 Processor 2.60 GHz 28 cores × 2 |
| Memory | DDR4 ECC 3200MT/s 16GB × 16 |
| Disk | 480GB SSD SATA × 1 |
| HCA | Mellanox ConnectX-5 EDR 100 Gb/s InfiniBand HCA |
| Switch | Mellanox SB7800 EDR 100Gb/s InfiniBand Smart Edge Switches |
| Cable | Mellanox MCP1600-E0xxEyy 100Gb/s QSFP28 DAC Cable, 3m |
The software environment configuration is shown below.
| Component | Specification |
|---|---|
| Compiler | GCC-14.2.0 | Intel® oneAPI DPC++/C++ Compiler 2025.3.0 |
| Operating System | Rocky Linux 8.10 (Green Obsidian) |
| Software | Python-3.13 |
| Performance Analysis Tool | Intel Vtune Profiler |
We adopted the Intel oneAPI 2025.3 suite to leverage its advanced vectorization capabilities and high-performance MPI libraries.
Intel oneAPI Installation
module load gcc/14.2.0
# install Base Toolkitsh intel-oneapi-base-toolkit-2025.3.0.375_offline.sh \ -a --silent --eula accept \ --install-dir=/data/home/202431060045/Share/oneapi25
# install HPC Toolkit, include Fortran and MPIsh intel-oneapi-hpc-toolkit-2025.3.0.381_offline.sh \ -a --silent --eula accept \ --install-dir=/data/home/202431060045/Share/oneapi25
# Initialize the environmentsource /data/home/202431060045/Share/oneapi25/setvars.shInstallation Process
The makefile.inc file was modified as follows:
LDLIBS = -lifcore -lifport -limf -lsvml -lintlc -L/usr/lib64CXXAPPFLAGS = -O3 -Wno-deprecated -Dfortran3 -Dnewcf90appflags = -O3 -fppf90 = ifxf77 = ifxCXX = icpxCC = icxCLINKER = mpiicpxAMSS-NCKU Execution
To ensure stability under Intel oneAPI optimizations, we adjusted the Linux resource limits to avoid stack overflows and ensure sufficient pinned memory for InfiniBand communication.
#!/bin/bash#SBATCH -p cn#SBATCH -J AMSS-baseline#SBATCH --nodes=2#SBATCH --ntasks-per-node=32#SBATCH --cpus-per-task=1#SBATCH --chdir=/data/home/202431060045/yr/AMSS-NCKU-baseline#SBATCH --output=./logs/run_amss_%j.out#SBATCH --error=./logs/run_amss_%j.err
ulimit -s unlimitedulimit -l unlimited
module load anaconda/3-2023.03conda activate AMSSmodule load gcc/14.2.0
source /data/home/202431060045/Share/oneapi25/setvars.sh
ROOT="/data/home/202431060045/yr/AMSS-NCKU-baseline"
cd ${ROOT}
echo "Script started at: $(date +"%Y-%m-%d %H:%M:%S")"
python AMSS_NCKU_Program.py
echo "Script finished at: $(date +"%Y-%m-%d %H:%M:%S")"The total execution time of the baseline code is:
This Program Cost = 47894.33059215546 SecondsCorrectness Verification
To verify the reliability of the optimized code, we used the execution results of the initial unmodified code as a reference. The Root Mean Square (RMS) Error was computed using:
Analysis of the Overall Structure
The overall architecture of the AMSS-NCKU simulation code can be abstracted into three primary stages:
- Data initialization
- Spacetime evolution-analysis iteration
- Result visualization
The vast majority of computational cost is concentrated in the spacetime evolution-analysis iteration stage.
During the evolution phase, the main evolution loop is driven by the Evolve function, which invokes the RecursiveStep function at each time step to update the adaptive mesh. Specifically, RecursiveStep employs a top-down hierarchical traversal strategy: it first calls the Step function on the coarsest grid level (Level 0), then recursively proceeds to finer refinement levels.
Because refined grid levels require interpolated boundary conditions from coarser levels and must later project high-accuracy data back upward, strict temporal dependencies arise between AMR hierarchy levels. These dependencies become the primary bottleneck hindering task parallelization.

Recursion Elimination Strategy Based on Iteration
The deep recursive invocation mechanism of RecursiveStep introduces substantial stack overhead and complicates performance profiling.
Although directly decoupling the spacetime dependencies is mathematically difficult, we adopted a refactoring strategy that replaces recursive calls with explicit iteration loops. This transformation significantly reduces performance analysis complexity and exposes the true computational hotspots.
Recursive Algorithm (Before)
RecursiveStep(level): Step(level) RecursiveStep(level + 1) EnforceConstraints()Iterative Algorithm (After)
IterativeRecursiveStep(): while level >= start_level: execute_step() update_constraints() handle_return_logic()

Task-Based Process Splitting
The original evolution-analysis pipeline used a strongly serially coupled architecture. After each evolution step, the program blocked until analysis finished.
To eliminate this unnecessary synchronization barrier, we introduced an asynchronous process-group decoupling strategy.
The global MPI process pool is partitioned into:
- Evolution process group
- Analysis process group
Evolution processes transmit grid state data using MPI_Send, then immediately continue to the next timestep without waiting. Analysis processes independently receive data via MPI_Recv and execute post-processing asynchronously.
This approach significantly improves overlap between evolution and analysis workloads.

Odd-Even Communication Scheme
The initial “half-split block mapping” strategy assigned the first half of MPI ranks to evolution and the second half to analysis.
However, large-scale deployment caused severe cross-node communication overhead.
To solve this issue, we implemented an “Interleaved Pairing” strategy:
- Even ranks → evolution tasks
- Odd ranks → analysis tasks
This topology-aware mapping ensures most communicating process pairs reside on the same node or NUMA domain, reducing expensive InfiniBand traffic.


Distributed File I/O
After process-group partitioning, the original global output logic became incompatible with the separated communication domains.
We therefore refactored the monitor logic:
-
Evolution group rank 0 outputs:
bssn_BH.datbssn_constraint.dat
-
Analysis group rank 0 outputs:
bssn_psi4.datbssn_ADMQs.dat
This eliminates unnecessary inter-group synchronization and avoids I/O contention.

Optimization of 3D Spatial Interpolation
Performance profiling revealed that the Fortran-based polin3 interpolation routine dominated the analysis phase runtime.
The original implementation repeatedly invoked 1D interpolation inside deeply nested loops.
Before Optimization
for z: for y: interpolate_x()
for z: interpolate_y()
interpolate_z()To improve vectorization and cache locality, we redesigned the interpolation as array-level batch operations.
After Optimization
tmp = interpolate_x_batch(data)tmp2 = interpolate_y_batch(tmp)res = interpolate_z_batch(tmp2)This optimization reduced analysis latency from approximately 8 seconds to 3 seconds per step.
Mathematical Equivalence of Interpolation Refactoring
The 3D interpolation operator can be written as:
where the 1D Lagrange basis function is:
The tensor-product structure satisfies:
Therefore, sequential 1D interpolation is mathematically equivalent to direct tensor-product interpolation.
Exploration of FCFS Communication Scheduling
We explored a “first-come, first-served” communication strategy using MPI_Waitany.
The idea was to process incoming messages immediately after arrival, overlapping communication with unpacking work.
However, large-scale testing revealed several problems:
- Additional runtime overhead from frequent
MPI_Waitany - Increased synchronization complexity
- Race-condition risks
- Unstable runtime behavior
Ultimately, the original “bulk communication + unified processing” strategy proved more robust and predictable.
Compiler-Level Optimization
Compiler optimization flags significantly impacted both performance and numerical precision.
Aggressive optimization of C/C++ modules introduced unacceptable RMS errors (~5%), primarily due to unstable vectorized transformations.
Therefore:
- C/C++ modules → conservative optimization
- Fortran modules → aggressive optimization
The Fortran compiler flags were:
| Flag | Description |
|---|---|
-O3 | Highest optimization level |
-mtune=icelake-server | Tune for Ice Lake CPUs |
-ffast-math | Fast floating-point optimizations |
-funroll-loops | Loop unrolling |
-march=native | Host CPU optimization |
-fpp | Enable preprocessor |
This strategy improved runtime by roughly 1 second while maintaining RMS error at approximately .
GPU Implementation
We also explored GPU acceleration using:
- 8 × NVIDIA A100 80GB SXM GPUs
- CUDA 12.4
Although GPU kernels achieved strong acceleration individually, PCIe synchronization overhead severely limited end-to-end gains.
The optimized CPU implementation ultimately outperformed the GPU version in total runtime efficiency.
Optimizations of the GPU-Based BSSN Solver
Several GPU-side optimizations were introduced:
- Replaced deprecated
cudaThreadSynchronize - Corrected boundary-condition logic
- Introduced GPU memory pools
- Eliminated frequent
cudaMallocandcudaFree - Implemented multi-GPU parallelism
Performance profiling confirmed substantial throughput improvements.

Acceleration of the GPU Analysis Module
The interpolation routine Interp_Points was migrated entirely to the GPU.
Key optimizations included:
- Massive thread-level parallelism
- Device-side Neville interpolation
- Constant memory optimization
- Flattened nested pointer structures
The final interpolation efficiency improved by nearly 5×.


Acceleration of the GPU prolong3 Routine
The sixth-order prolongation operator prolong3 was also ported to CUDA.
Optimizations included:
- Custom interpolation kernels
- Symmetry boundary kernels
- Constant-memory coefficient storage
However, frequent Host-to-Device and Device-to-Host transfers limited the overall benefit.

Performance Comparison
Determining the Baseline
The baseline testing explored multiple MPI process counts.
| Process Number | Time per Iteration | RMS |
|---|---|---|
| 4 | 115.0041 s | 0.0203% |
| 8 | 83.0201 s | 0.0203% |
| 16 | 66.1602 s | 0.0203% |
| 32 | 49.9385 s | 0.0203% |
| 56 | 26.9602 s | 0.0203% |
| 64 | 19.9866 s | 0.0000% |
| 128 | 17.1985 s | 0.0203% |
The 64-process configuration was selected as the final baseline.
Time Comparison
The CPU and GPU optimization results are summarized below.

Summary
After extensive optimization and profiling, the CPU-optimized implementation was selected as the final solution.
Final Results
| Device | Version | Total Time (s) | Speedup |
|---|---|---|---|
| CPU | Baseline | 47894.33059215546 | 1.00× |
| CPU | Final Optimized | 9531.030975341797 | 5.02× |
| GPU | Baseline | 187266.8326153278486 | 1.00× |
| GPU | Final Optimized | 19443.30318969726588 | 9.63× |
The final RMS error remained at 0.0000%, fully satisfying precision requirements.
Final Output Figures



If this article helped you, please share it with others!
Some information may be outdated





