mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
1737 words
9 minutes
AMSS-NCKU
2026-05-01

选自ASC26赛题,AMSS-NCKU程序是我国首个自主开发的数值相对论计算程序,基于有限差分方法与自适应网格加密技术,适用于双黑洞合并等强引力场系统的模拟。

YRemmmm
/
AMSS-NCKU-ASC26
Waiting for api.github.com...
00K
0K
0K
Waiting...

Numerical Relativity Code AMSS-NCKU Optimization#

Before detailing the experimental setup, we first clarify several key concepts. The term “full scale” is defined by a Final_Evolution_Time of 1000.0, while the “test scale” is defined by a Final_Evolution_Time of 5.0. To streamline performance comparisons, any reference to time implies the computational cost of a single evolutionary iteration step, unless explicitly stated otherwise. For the present analysis, the 64-process results are utilized as the baseline (equivalent to 20 seconds per iteration step), a choice that will be justified later in the text.

Topic Background#

General Relativity (GR), formulated by Einstein, describes gravity as the geometric curvature of spacetime and serves as the theoretical cornerstone of modern astrophysics and cosmology. Its core governing equations—the Einstein Field Equations (EFEs)—elucidate the nonlinear coupling between matter distribution and spacetime structure. In the era of gravitational-wave astronomy, the research value of GR has become increasingly prominent. Whether driven by the breakthrough observations of the Laser Interferometer Gravitational-Wave Observatory (LIGO) or China’s ongoing Ali Primordial Gravitational Wave Detection project (aimed at exploring the 10161018 Hz10^{-16}\sim10^{-18}\text{ Hz} frequency band), tasks such as signal identification, parameter estimation, and the analysis of physical mechanisms rely heavily on accurate theoretical modeling and numerical simulations of extreme astrophysical processes, particularly binary black hole (BBH) mergers.

In the highly dynamic, strong-field regime of BBH mergers, extreme spacetime curvature and relativistic velocities cause the nonlinear terms of the EFEs to dominate, rendering analytical approximation methods (such as post-Newtonian approximations and perturbation theory) invalid. Consequently, the discrete iterative schemes of Numerical Relativity (NR) emerge as the only viable approach for accurately solving the EFEs in these scenarios. Currently, mainstream technical frameworks typically rely on the BSSN formulation or the CCZ4 constraint-damping system under the ADM decomposition to ensure numerical stability. These frameworks utilize high-order finite difference methods coupled with Runge-Kutta time integration for discretization. Furthermore, to bridge the vast scale disparity between the black hole horizon singularity and gravitational wave wavelengths, core algorithms widely integrate Berger-Oliger-style adaptive mesh refinement (AMR) and employ MPI parallel communication strategies based on domain decomposition. This architecture enables efficient numerical simulations of extreme compact object coalescences, such as BBH mergers, on supercomputers.

However, these mainstream techniques still face severe challenges when addressing the demands of ultra-high precision and large-scale simulations. The low arithmetic intensity of finite difference operators is heavily bound by memory bandwidth, making it difficult to fully exploit the peak floating-point performance of modern hardware. Within the Berger-Oliger AMR architecture, frequent inter-level interpolations during mesh regridding induce massive MPI communication overhead, making the code highly susceptible to the “communication wall” bottleneck during large-scale parallelization. Additionally, load imbalances caused by dynamic mesh evolution, coupled with the complexity of porting intricate control logic to heterogeneous accelerators, constitute the core technical barriers limiting the scalability and computational efficiency of NR codes.

This work focuses on the in-depth performance optimization of the open-source numerical relativity code AMSS-NCKU, aiming to resolve computational bottlenecks in high-precision spacetime evolution. By fine-tuning key modules—including the finite difference computational kernels, AMR, and MPI parallel communication—we employ a comprehensive suite of optimization strategies. These include the decoupling of evolution-analysis loops, GPU heterogeneous acceleration, process affinity binding, communication aggregation optimization, 3D interpolation algorithm improvements, recursion elimination, and compiler optimization flag tuning. The ultimate objective is to minimize total simulation time, thereby solidifying numerical relativity’s position as a quintessential cutting-edge application in high-performance computing (HPC) and providing a highly efficient computational engine for future large-scale gravitational wave data analysis.


Clusters Configuration in AMSS-NCKU Problem#

The configuration of the computational cluster we used is shown below.

ComponentSpecification
CPUIntel Xeon Gold 6348 Processor 2.60 GHz 28 cores × 2
MemoryDDR4 ECC 3200MT/s 16GB × 16
Disk480GB SSD SATA × 1
HCAMellanox ConnectX-5 EDR 100 Gb/s InfiniBand HCA
SwitchMellanox SB7800 EDR 100Gb/s InfiniBand Smart Edge Switches
CableMellanox MCP1600-E0xxEyy 100Gb/s QSFP28 DAC Cable, 3m

The software environment configuration is shown below.

ComponentSpecification
CompilerGCC-14.2.0 | Intel® oneAPI DPC++/C++ Compiler 2025.3.0
Operating SystemRocky Linux 8.10 (Green Obsidian)
SoftwarePython-3.13
Performance Analysis ToolIntel Vtune Profiler

We adopted the Intel oneAPI 2025.3 suite to leverage its advanced vectorization capabilities and high-performance MPI libraries.

Intel oneAPI Installation#

module load gcc/14.2.0
# install Base Toolkit
sh intel-oneapi-base-toolkit-2025.3.0.375_offline.sh \
-a --silent --eula accept \
--install-dir=/data/home/202431060045/Share/oneapi25
# install HPC Toolkit, include Fortran and MPI
sh intel-oneapi-hpc-toolkit-2025.3.0.381_offline.sh \
-a --silent --eula accept \
--install-dir=/data/home/202431060045/Share/oneapi25
# Initialize the environment
source /data/home/202431060045/Share/oneapi25/setvars.sh

Installation Process#

The makefile.inc file was modified as follows:

LDLIBS = -lifcore -lifport -limf -lsvml -lintlc -L/usr/lib64
CXXAPPFLAGS = -O3 -Wno-deprecated -Dfortran3 -Dnewc
f90appflags = -O3 -fpp
f90 = ifx
f77 = ifx
CXX = icpx
CC = icx
CLINKER = mpiicpx

AMSS-NCKU Execution#

To ensure stability under Intel oneAPI optimizations, we adjusted the Linux resource limits to avoid stack overflows and ensure sufficient pinned memory for InfiniBand communication.

#!/bin/bash
#SBATCH -p cn
#SBATCH -J AMSS-baseline
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --chdir=/data/home/202431060045/yr/AMSS-NCKU-baseline
#SBATCH --output=./logs/run_amss_%j.out
#SBATCH --error=./logs/run_amss_%j.err
ulimit -s unlimited
ulimit -l unlimited
module load anaconda/3-2023.03
conda activate AMSS
module load gcc/14.2.0
source /data/home/202431060045/Share/oneapi25/setvars.sh
ROOT="/data/home/202431060045/yr/AMSS-NCKU-baseline"
cd ${ROOT}
echo "Script started at: $(date +"%Y-%m-%d %H:%M:%S")"
python AMSS_NCKU_Program.py
echo "Script finished at: $(date +"%Y-%m-%d %H:%M:%S")"

The total execution time of the baseline code is:

This Program Cost = 47894.33059215546 Seconds

Correctness Verification#

To verify the reliability of the optimized code, we used the execution results of the initial unmodified code as a reference. The Root Mean Square (RMS) Error was computed using:

RMS=1Mi=1M(r1ir2imax(r1i,r2i))2RMS=\sqrt{\frac{1}{M}\sum_{i=1}^{M}\left(\frac{|r_{1i}-r_{2i}|}{\max(r_{1i},r_{2i})}\right)^2}

Analysis of the Overall Structure#

The overall architecture of the AMSS-NCKU simulation code can be abstracted into three primary stages:

  1. Data initialization
  2. Spacetime evolution-analysis iteration
  3. Result visualization

The vast majority of computational cost is concentrated in the spacetime evolution-analysis iteration stage.

During the evolution phase, the main evolution loop is driven by the Evolve function, which invokes the RecursiveStep function at each time step to update the adaptive mesh. Specifically, RecursiveStep employs a top-down hierarchical traversal strategy: it first calls the Step function on the coarsest grid level (Level 0), then recursively proceeds to finer refinement levels.

Because refined grid levels require interpolated boundary conditions from coarser levels and must later project high-accuracy data back upward, strict temporal dependencies arise between AMR hierarchy levels. These dependencies become the primary bottleneck hindering task parallelization.

AMSS-NCKU Architecture


Recursion Elimination Strategy Based on Iteration#

The deep recursive invocation mechanism of RecursiveStep introduces substantial stack overhead and complicates performance profiling.

Although directly decoupling the spacetime dependencies is mathematically difficult, we adopted a refactoring strategy that replaces recursive calls with explicit iteration loops. This transformation significantly reduces performance analysis complexity and exposes the true computational hotspots.

Recursive Algorithm (Before)#

RecursiveStep(level):
Step(level)
RecursiveStep(level + 1)
EnforceConstraints()

Iterative Algorithm (After)#

IterativeRecursiveStep():
while level >= start_level:
execute_step()
update_constraints()
handle_return_logic()

Flame graph of Recursive Algorithm

Flame graph of Iterative Simulation


Task-Based Process Splitting#

The original evolution-analysis pipeline used a strongly serially coupled architecture. After each evolution step, the program blocked until analysis finished.

To eliminate this unnecessary synchronization barrier, we introduced an asynchronous process-group decoupling strategy.

The global MPI process pool is partitioned into:

  • Evolution process group
  • Analysis process group

Evolution processes transmit grid state data using MPI_Send, then immediately continue to the next timestep without waiting. Analysis processes independently receive data via MPI_Recv and execute post-processing asynchronously.

This approach significantly improves overlap between evolution and analysis workloads.

Evolution-Analysis Task Decoupling


Odd-Even Communication Scheme#

The initial “half-split block mapping” strategy assigned the first half of MPI ranks to evolution and the second half to analysis.

However, large-scale deployment caused severe cross-node communication overhead.

To solve this issue, we implemented an “Interleaved Pairing” strategy:

  • Even ranks → evolution tasks
  • Odd ranks → analysis tasks

This topology-aware mapping ensures most communicating process pairs reside on the same node or NUMA domain, reducing expensive InfiniBand traffic.

Half Communication

Odd-Even Communication


Distributed File I/O#

After process-group partitioning, the original global output logic became incompatible with the separated communication domains.

We therefore refactored the monitor logic:

  • Evolution group rank 0 outputs:

    • bssn_BH.dat
    • bssn_constraint.dat
  • Analysis group rank 0 outputs:

    • bssn_psi4.dat
    • bssn_ADMQs.dat

This eliminates unnecessary inter-group synchronization and avoids I/O contention.

Refactored Monitor Logic


Optimization of 3D Spatial Interpolation#

Performance profiling revealed that the Fortran-based polin3 interpolation routine dominated the analysis phase runtime.

The original implementation repeatedly invoked 1D interpolation inside deeply nested loops.

Before Optimization#

for z:
for y:
interpolate_x()
for z:
interpolate_y()
interpolate_z()

To improve vectorization and cache locality, we redesigned the interpolation as array-level batch operations.

After Optimization#

tmp = interpolate_x_batch(data)
tmp2 = interpolate_y_batch(tmp)
res = interpolate_z_batch(tmp2)

This optimization reduced analysis latency from approximately 8 seconds to 3 seconds per step.


Mathematical Equivalence of Interpolation Refactoring#

The 3D interpolation operator can be written as:

Ixyzf(x,y,z)=i=1nxj=1nyk=1nzfijkLi(x)Lj(y)Lk(z)\mathcal{I}_{xyz}f(x^*,y^*,z^*) = \sum_{i=1}^{n_x} \sum_{j=1}^{n_y} \sum_{k=1}^{n_z} f_{ijk}L_i(x^*)L_j(y^*)L_k(z^*)

where the 1D Lagrange basis function is:

Li(x)=m=1minxxxmxixmL_i(x^*) = \prod_{\substack{m=1 \\ m\neq i}}^{n_x} \frac{x^*-x_m}{x_i-x_m}

The tensor-product structure satisfies:

Ixyz=IzIyIx\mathcal{I}_{xyz} = \mathcal{I}_z \circ \mathcal{I}_y \circ \mathcal{I}_x

Therefore, sequential 1D interpolation is mathematically equivalent to direct tensor-product interpolation.


Exploration of FCFS Communication Scheduling#

We explored a “first-come, first-served” communication strategy using MPI_Waitany.

The idea was to process incoming messages immediately after arrival, overlapping communication with unpacking work.

However, large-scale testing revealed several problems:

  • Additional runtime overhead from frequent MPI_Waitany
  • Increased synchronization complexity
  • Race-condition risks
  • Unstable runtime behavior

Ultimately, the original “bulk communication + unified processing” strategy proved more robust and predictable.


Compiler-Level Optimization#

Compiler optimization flags significantly impacted both performance and numerical precision.

Aggressive optimization of C/C++ modules introduced unacceptable RMS errors (~5%), primarily due to unstable vectorized transformations.

Therefore:

  • C/C++ modules → conservative optimization
  • Fortran modules → aggressive optimization

The Fortran compiler flags were:

FlagDescription
-O3Highest optimization level
-mtune=icelake-serverTune for Ice Lake CPUs
-ffast-mathFast floating-point optimizations
-funroll-loopsLoop unrolling
-march=nativeHost CPU optimization
-fppEnable preprocessor

This strategy improved runtime by roughly 1 second while maintaining RMS error at approximately 10610^{-6}.


GPU Implementation#

We also explored GPU acceleration using:

  • 8 × NVIDIA A100 80GB SXM GPUs
  • CUDA 12.4

Although GPU kernels achieved strong acceleration individually, PCIe synchronization overhead severely limited end-to-end gains.

The optimized CPU implementation ultimately outperformed the GPU version in total runtime efficiency.


Optimizations of the GPU-Based BSSN Solver#

Several GPU-side optimizations were introduced:

  • Replaced deprecated cudaThreadSynchronize
  • Corrected boundary-condition logic
  • Introduced GPU memory pools
  • Eliminated frequent cudaMalloc and cudaFree
  • Implemented multi-GPU parallelism

Performance profiling confirmed substantial throughput improvements.

Flame graph of ABEGPU


Acceleration of the GPU Analysis Module#

The interpolation routine Interp_Points was migrated entirely to the GPU.

Key optimizations included:

  • Massive thread-level parallelism
  • Device-side Neville interpolation
  • Constant memory optimization
  • Flattened nested pointer structures

The final interpolation efficiency improved by nearly 5×.

Program call graph under Vtune

Optimized interpolation function


Acceleration of the GPU prolong3 Routine#

The sixth-order prolongation operator prolong3 was also ported to CUDA.

Optimizations included:

  • Custom interpolation kernels
  • Symmetry boundary kernels
  • Constant-memory coefficient storage

However, frequent Host-to-Device and Device-to-Host transfers limited the overall benefit.

Function call proportion


Performance Comparison#

Determining the Baseline#

The baseline testing explored multiple MPI process counts.

Process NumberTime per IterationRMS
4115.0041 s0.0203%
883.0201 s0.0203%
1666.1602 s0.0203%
3249.9385 s0.0203%
5626.9602 s0.0203%
6419.9866 s0.0000%
12817.1985 s0.0203%

The 64-process configuration was selected as the final baseline.


Time Comparison#

The CPU and GPU optimization results are summarized below.

CPU/GPU Optimization Comparison


Summary#

After extensive optimization and profiling, the CPU-optimized implementation was selected as the final solution.

Final Results#

DeviceVersionTotal Time (s)Speedup
CPUBaseline47894.330592155461.00×
CPUFinal Optimized9531.0309753417975.02×
GPUBaseline187266.83261532784861.00×
GPUFinal Optimized19443.303189697265889.63×

The final RMS error remained at 0.0000%, fully satisfying precision requirements.

Final Output Figures#

ADM Constraint Grid Level 0

BH Trajectory XY

BH Trajectory 21 XY

Share

If this article helped you, please share it with others!

AMSS-NCKU
https://blog.yremmmm.com/posts/amss-ncku/
Author
why?
Published at
2026-05-01
License
CC BY-NC-SA 4.0

Some information may be outdated

Table of Contents