AMSS-NCKU - YRemmmm`s Blog

选自ASC26赛题，AMSS-NCKU程序是我国首个自主开发的数值相对论计算程序，基于有限差分方法与自适应网格加密技术，适用于双黑洞合并等强引力场系统的模拟。

Numerical Relativity Code AMSS-NCKU Optimization#

Before detailing the experimental setup, we first clarify several key concepts. The term “full scale” is defined by a Final_Evolution_Time of 1000.0, while the “test scale” is defined by a Final_Evolution_Time of 5.0. To streamline performance comparisons, any reference to time implies the computational cost of a single evolutionary iteration step, unless explicitly stated otherwise. For the present analysis, the 64-process results are utilized as the baseline (equivalent to 20 seconds per iteration step), a choice that will be justified later in the text.

Topic Background#

General Relativity (GR), formulated by Einstein, describes gravity as the geometric curvature of spacetime and serves as the theoretical cornerstone of modern astrophysics and cosmology. Its core governing equations—the Einstein Field Equations (EFEs)—elucidate the nonlinear coupling between matter distribution and spacetime structure. In the era of gravitational-wave astronomy, the research value of GR has become increasingly prominent. Whether driven by the breakthrough observations of the Laser Interferometer Gravitational-Wave Observatory (LIGO) or China’s ongoing Ali Primordial Gravitational Wave Detection project (aimed at exploring the $10^{-16}\sim10^{-18}\text{ Hz}$ frequency band), tasks such as signal identification, parameter estimation, and the analysis of physical mechanisms rely heavily on accurate theoretical modeling and numerical simulations of extreme astrophysical processes, particularly binary black hole (BBH) mergers.

In the highly dynamic, strong-field regime of BBH mergers, extreme spacetime curvature and relativistic velocities cause the nonlinear terms of the EFEs to dominate, rendering analytical approximation methods (such as post-Newtonian approximations and perturbation theory) invalid. Consequently, the discrete iterative schemes of Numerical Relativity (NR) emerge as the only viable approach for accurately solving the EFEs in these scenarios. Currently, mainstream technical frameworks typically rely on the BSSN formulation or the CCZ4 constraint-damping system under the ADM decomposition to ensure numerical stability. These frameworks utilize high-order finite difference methods coupled with Runge-Kutta time integration for discretization. Furthermore, to bridge the vast scale disparity between the black hole horizon singularity and gravitational wave wavelengths, core algorithms widely integrate Berger-Oliger-style adaptive mesh refinement (AMR) and employ MPI parallel communication strategies based on domain decomposition. This architecture enables efficient numerical simulations of extreme compact object coalescences, such as BBH mergers, on supercomputers.

However, these mainstream techniques still face severe challenges when addressing the demands of ultra-high precision and large-scale simulations. The low arithmetic intensity of finite difference operators is heavily bound by memory bandwidth, making it difficult to fully exploit the peak floating-point performance of modern hardware. Within the Berger-Oliger AMR architecture, frequent inter-level interpolations during mesh regridding induce massive MPI communication overhead, making the code highly susceptible to the “communication wall” bottleneck during large-scale parallelization. Additionally, load imbalances caused by dynamic mesh evolution, coupled with the complexity of porting intricate control logic to heterogeneous accelerators, constitute the core technical barriers limiting the scalability and computational efficiency of NR codes.

This work focuses on the in-depth performance optimization of the open-source numerical relativity code AMSS-NCKU, aiming to resolve computational bottlenecks in high-precision spacetime evolution. By fine-tuning key modules—including the finite difference computational kernels, AMR, and MPI parallel communication—we employ a comprehensive suite of optimization strategies. These include the decoupling of evolution-analysis loops, GPU heterogeneous acceleration, process affinity binding, communication aggregation optimization, 3D interpolation algorithm improvements, recursion elimination, and compiler optimization flag tuning. The ultimate objective is to minimize total simulation time, thereby solidifying numerical relativity’s position as a quintessential cutting-edge application in high-performance computing (HPC) and providing a highly efficient computational engine for future large-scale gravitational wave data analysis.

Clusters Configuration in AMSS-NCKU Problem#

The configuration of the computational cluster we used is shown below.

Component	Specification
CPU	Intel Xeon Gold 6348 Processor 2.60 GHz 28 cores × 2
Memory	DDR4 ECC 3200MT/s 16GB × 16
Disk	480GB SSD SATA × 1
HCA	Mellanox ConnectX-5 EDR 100 Gb/s InfiniBand HCA
Switch	Mellanox SB7800 EDR 100Gb/s InfiniBand Smart Edge Switches
Cable	Mellanox MCP1600-E0xxEyy 100Gb/s QSFP28 DAC Cable, 3m

The software environment configuration is shown below.

Component	Specification
Compiler	GCC-14.2.0 \| Intel® oneAPI DPC++/C++ Compiler 2025.3.0
Operating System	Rocky Linux 8.10 (Green Obsidian)
Software	Python-3.13
Performance Analysis Tool	Intel Vtune Profiler

We adopted the Intel oneAPI 2025.3 suite to leverage its advanced vectorization capabilities and high-performance MPI libraries.

Intel oneAPI Installation#

1
module load gcc/14.2.0
2

3
# install Base Toolkit
4
sh intel-oneapi-base-toolkit-2025.3.0.375_offline.sh \
5
    -a --silent --eula accept \
6
    --install-dir=/data/home/202431060045/Share/oneapi25
7

8
# install HPC Toolkit, include Fortran and MPI
9
sh intel-oneapi-hpc-toolkit-2025.3.0.381_offline.sh \
10
    -a --silent --eula accept \
11
    --install-dir=/data/home/202431060045/Share/oneapi25
12

13
# Initialize the environment
14
source /data/home/202431060045/Share/oneapi25/setvars.sh

Installation Process#

The makefile.inc file was modified as follows:

1
LDLIBS = -lifcore -lifport -limf -lsvml -lintlc -L/usr/lib64
2
CXXAPPFLAGS  = -O3 -Wno-deprecated -Dfortran3 -Dnewc
3
f90appflags = -O3 -fpp
4
f90          = ifx
5
f77          = ifx
6
CXX          = icpx
7
CC           = icx
8
CLINKER      = mpiicpx

AMSS-NCKU Execution#

To ensure stability under Intel oneAPI optimizations, we adjusted the Linux resource limits to avoid stack overflows and ensure sufficient pinned memory for InfiniBand communication.

1
#!/bin/bash
2
#SBATCH -p cn
3
#SBATCH -J AMSS-baseline
4
#SBATCH --nodes=2
5
#SBATCH --ntasks-per-node=32
6
#SBATCH --cpus-per-task=1
7
#SBATCH --chdir=/data/home/202431060045/yr/AMSS-NCKU-baseline
8
#SBATCH --output=./logs/run_amss_%j.out
9
#SBATCH --error=./logs/run_amss_%j.err
10

11
ulimit -s unlimited
12
ulimit -l unlimited
13

14
module load anaconda/3-2023.03
15
conda activate AMSS
16
module load gcc/14.2.0
17

18
source /data/home/202431060045/Share/oneapi25/setvars.sh
19

20
ROOT="/data/home/202431060045/yr/AMSS-NCKU-baseline"
21

22
cd ${ROOT}
23

24
echo "Script started at: $(date +"%Y-%m-%d %H:%M:%S")"
25

26
python AMSS_NCKU_Program.py
27

28
echo "Script finished at: $(date +"%Y-%m-%d %H:%M:%S")"

The total execution time of the baseline code is:

1
This Program Cost = 47894.33059215546 Seconds

Correctness Verification#

To verify the reliability of the optimized code, we used the execution results of the initial unmodified code as a reference. The Root Mean Square (RMS) Error was computed using:

RMS=\sqrt{\frac{1}{M}\sum_{i=1}^{M}\left(\frac{|r_{1i}-r_{2i}|}{\max(r_{1i},r_{2i})}\right)^2}

Analysis of the Overall Structure#

The overall architecture of the AMSS-NCKU simulation code can be abstracted into three primary stages:

Data initialization
Spacetime evolution-analysis iteration
Result visualization

The vast majority of computational cost is concentrated in the spacetime evolution-analysis iteration stage.

During the evolution phase, the main evolution loop is driven by the Evolve function, which invokes the RecursiveStep function at each time step to update the adaptive mesh. Specifically, RecursiveStep employs a top-down hierarchical traversal strategy: it first calls the Step function on the coarsest grid level (Level 0), then recursively proceeds to finer refinement levels.

Because refined grid levels require interpolated boundary conditions from coarser levels and must later project high-accuracy data back upward, strict temporal dependencies arise between AMR hierarchy levels. These dependencies become the primary bottleneck hindering task parallelization.

AMSS-NCKU Architecture

Recursion Elimination Strategy Based on Iteration#

The deep recursive invocation mechanism of RecursiveStep introduces substantial stack overhead and complicates performance profiling.

Although directly decoupling the spacetime dependencies is mathematically difficult, we adopted a refactoring strategy that replaces recursive calls with explicit iteration loops. This transformation significantly reduces performance analysis complexity and exposes the true computational hotspots.

Recursive Algorithm (Before)#

1
RecursiveStep(level):
2
    Step(level)
3
    RecursiveStep(level + 1)
4
    EnforceConstraints()

Iterative Algorithm (After)#

1
IterativeRecursiveStep():
2
    while level >= start_level:
3
        execute_step()
4
        update_constraints()
5
        handle_return_logic()

Flame graph of Recursive Algorithm

Flame graph of Iterative Simulation

Task-Based Process Splitting#

The original evolution-analysis pipeline used a strongly serially coupled architecture. After each evolution step, the program blocked until analysis finished.

To eliminate this unnecessary synchronization barrier, we introduced an asynchronous process-group decoupling strategy.

The global MPI process pool is partitioned into:

Evolution process group
Analysis process group

Evolution processes transmit grid state data using MPI_Send, then immediately continue to the next timestep without waiting. Analysis processes independently receive data via MPI_Recv and execute post-processing asynchronously.

This approach significantly improves overlap between evolution and analysis workloads.

Evolution-Analysis Task Decoupling

Odd-Even Communication Scheme#

The initial “half-split block mapping” strategy assigned the first half of MPI ranks to evolution and the second half to analysis.

However, large-scale deployment caused severe cross-node communication overhead.

To solve this issue, we implemented an “Interleaved Pairing” strategy:

Even ranks → evolution tasks
Odd ranks → analysis tasks

This topology-aware mapping ensures most communicating process pairs reside on the same node or NUMA domain, reducing expensive InfiniBand traffic.

Half Communication

Odd-Even Communication

Distributed File I/O#

After process-group partitioning, the original global output logic became incompatible with the separated communication domains.

We therefore refactored the monitor logic:

Evolution group rank 0 outputs:
- bssn_BH.dat
- bssn_constraint.dat
Analysis group rank 0 outputs:
- bssn_psi4.dat
- bssn_ADMQs.dat

This eliminates unnecessary inter-group synchronization and avoids I/O contention.

Refactored Monitor Logic

Optimization of 3D Spatial Interpolation#

Performance profiling revealed that the Fortran-based polin3 interpolation routine dominated the analysis phase runtime.

The original implementation repeatedly invoked 1D interpolation inside deeply nested loops.

Before Optimization#

1
for z:
2
    for y:
3
        interpolate_x()
4

5
for z:
6
    interpolate_y()
7

8
interpolate_z()

To improve vectorization and cache locality, we redesigned the interpolation as array-level batch operations.

After Optimization#

1
tmp  = interpolate_x_batch(data)
2
tmp2 = interpolate_y_batch(tmp)
3
res  = interpolate_z_batch(tmp2)

This optimization reduced analysis latency from approximately 8 seconds to 3 seconds per step.

Mathematical Equivalence of Interpolation Refactoring#

The 3D interpolation operator can be written as:

\mathcal{I}_{xyz}f(x^*,y^*,z^*) = \sum_{i=1}^{n_x} \sum_{j=1}^{n_y} \sum_{k=1}^{n_z} f_{ijk}L_i(x^*)L_j(y^*)L_k(z^*)

where the 1D Lagrange basis function is:

L_i(x^*) = \prod_{\substack{m=1 \\ m\neq i}}^{n_x} \frac{x^*-x_m}{x_i-x_m}

The tensor-product structure satisfies:

\mathcal{I}_{xyz} = \mathcal{I}_z \circ \mathcal{I}_y \circ \mathcal{I}_x

Therefore, sequential 1D interpolation is mathematically equivalent to direct tensor-product interpolation.

Exploration of FCFS Communication Scheduling#

We explored a “first-come, first-served” communication strategy using MPI_Waitany.

The idea was to process incoming messages immediately after arrival, overlapping communication with unpacking work.

However, large-scale testing revealed several problems:

Additional runtime overhead from frequent MPI_Waitany
Increased synchronization complexity
Race-condition risks
Unstable runtime behavior

Ultimately, the original “bulk communication + unified processing” strategy proved more robust and predictable.

Compiler-Level Optimization#

Compiler optimization flags significantly impacted both performance and numerical precision.

Aggressive optimization of C/C++ modules introduced unacceptable RMS errors (~5%), primarily due to unstable vectorized transformations.

Therefore:

C/C++ modules → conservative optimization
Fortran modules → aggressive optimization

The Fortran compiler flags were:

Flag	Description
`-O3`	Highest optimization level
`-mtune=icelake-server`	Tune for Ice Lake CPUs
`-ffast-math`	Fast floating-point optimizations
`-funroll-loops`	Loop unrolling
`-march=native`	Host CPU optimization
`-fpp`	Enable preprocessor

This strategy improved runtime by roughly 1 second while maintaining RMS error at approximately $10^{-6}$ .

GPU Implementation#

We also explored GPU acceleration using:

8 × NVIDIA A100 80GB SXM GPUs
CUDA 12.4

Although GPU kernels achieved strong acceleration individually, PCIe synchronization overhead severely limited end-to-end gains.

The optimized CPU implementation ultimately outperformed the GPU version in total runtime efficiency.

Optimizations of the GPU-Based BSSN Solver#

Several GPU-side optimizations were introduced:

Replaced deprecated cudaThreadSynchronize
Corrected boundary-condition logic
Introduced GPU memory pools
Eliminated frequent cudaMalloc and cudaFree
Implemented multi-GPU parallelism

Performance profiling confirmed substantial throughput improvements.

Flame graph of ABEGPU

Acceleration of the GPU Analysis Module#

The interpolation routine Interp_Points was migrated entirely to the GPU.

Key optimizations included:

Massive thread-level parallelism
Device-side Neville interpolation
Constant memory optimization
Flattened nested pointer structures

The final interpolation efficiency improved by nearly 5×.

Program call graph under Vtune

Optimized interpolation function

Acceleration of the GPU prolong3 Routine#

The sixth-order prolongation operator prolong3 was also ported to CUDA.

Optimizations included:

Custom interpolation kernels
Symmetry boundary kernels
Constant-memory coefficient storage

However, frequent Host-to-Device and Device-to-Host transfers limited the overall benefit.

Function call proportion

Performance Comparison#

Determining the Baseline#

The baseline testing explored multiple MPI process counts.

Process Number	Time per Iteration	RMS
4	115.0041 s	0.0203%
8	83.0201 s	0.0203%
16	66.1602 s	0.0203%
32	49.9385 s	0.0203%
56	26.9602 s	0.0203%
64	19.9866 s	0.0000%
128	17.1985 s	0.0203%

The 64-process configuration was selected as the final baseline.

Time Comparison#

The CPU and GPU optimization results are summarized below.

CPU/GPU Optimization Comparison

Summary#

After extensive optimization and profiling, the CPU-optimized implementation was selected as the final solution.

Final Results#

Device	Version	Total Time (s)	Speedup
CPU	Baseline	47894.33059215546	1.00×
CPU	Final Optimized	9531.030975341797	5.02×
GPU	Baseline	187266.8326153278486	1.00×
GPU	Final Optimized	19443.30318969726588	9.63×

The final RMS error remained at 0.0000%, fully satisfying precision requirements.

Final Output Figures#

ADM Constraint Grid Level 0

BH Trajectory XY

BH Trajectory 21 XY

YRemmmm