Accelerating Auxiliary-Field Quantum Monte Carlo Simulations of Solids with Graphical Processing Units
At a Glance
Section titled âAt a Glanceâ| Metadata | Details |
|---|---|
| Publication Date | 2020-05-21 |
| Journal | Journal of Chemical Theory and Computation |
| Authors | Fionn D. Malone, Shuai Zhang, Miguel A. Morales |
| Institutions | Lawrence Livermore National Laboratory, Quantum Simulations (United States) |
| Citations | 34 |
| Analysis | Full AI Review Included |
Executive Summary
Section titled âExecutive Summaryâ- Core Achievement: Successfully accelerated Auxiliary-Field Quantum Monte Carlo (AFQMC) simulations of solid-state systems by leveraging Graphical Processing Units (GPUs) and exploiting crystal momentum conservation.
- Performance Gain: Achieved an overall speedup of approximately x40 compared to optimized CPU implementations for large systems using the k-point representation.
- Scaling and System Size: The GPU implementation enables routine simulation of systems previously considered prohibitive, handling up to 1728 electrons and 7344 basis functions (e.g., Carbon diamond 6x6x6 k-point grid).
- Accuracy Demonstrated: The cohesive energy of Carbon in the diamond structure was systematically converged to the thermodynamic and complete basis set (CBS) limit, yielding a result (-7.56 ± 0.01 eV/atom) within 0.02 eV of the experimental value.
- Key Optimization: Efficiency relies on reformulating the AFQMC algorithm to utilize batched dense linear algebra (BLAS/LAPACK), which is optimal for modern GPU architectures handling numerous small matrix operations inherent in the k-point method.
- Memory Efficiency: The k-point representation offers superior memory scaling (O(Nk-1) reduction in storage) compared to dense or sparse representations for periodic systems.
Technical Specifications
Section titled âTechnical Specificationsâ| Parameter | Value | Unit | Context |
|---|---|---|---|
| Peak Speedup (GPU vs. CPU) | ~40 | Factor | Overall AFQMC block execution time (k-point representation). |
| Energy Evaluation Speedup | Up to x20 | Factor | Speedup achieved by batching k-point sums during local energy calculation. |
| Cohesive Energy (AFQMC Extrap. cc) | -7.56(1) | eV/atom | Carbon diamond, extrapolated to thermodynamic and CBS limit. |
| Experimental Cohesive Energy | -7.545 | eV/atom | Carbon diamond (corrected for zero-point effects). |
| Maximum Basis Functions Simulated | 7344 | Functions | GTH-TZVP basis set, 6x6x6 k-point grid. |
| Maximum Electrons Simulated | 1728 | Electrons | Corresponds to the 6x6x6 k-point grid simulation. |
| GPU Architecture Used | V100 | GPU | Used on Summit supercomputer nodes. |
| Cholesky Factorization Threshold | 1 x 10-5 | Ha | Maximum error allowed on the diagonal elements. |
| Memory Scaling (k-point) | O(NLM) | Scaling | Storage cost for half-rotated Cholesky vectors (NL is number of Cholesky vectors, M is number of basis functions). |
Key Methodologies
Section titled âKey MethodologiesâThe acceleration of the phaseless AFQMC (ph-AFQMC) algorithm for solids was achieved through a combination of algorithmic reformulation and hardware-specific optimization:
- k-Point Representation: The Hamiltonian (one- and two-electron integrals) is explicitly represented using band and k-point indices, exploiting lattice translational symmetry. This reduces the number of stored two-electron integrals by a factor of 1/Nk (where Nk is the number of k-points).
- Cholesky Factorization: Electron repulsion integrals (ERIs) are factorized using a modified Cholesky decomposition, which is compatible with the k-point representation and further reduces storage and computational complexity.
- GPU Implementation Strategy: The algorithm was designed to maximize GPU utilization by adhering to three principles:
- Concurrent Walker Processing: All walkers in the population are processed simultaneously.
- Dense Linear Algebra: Algorithmic steps are implemented using dense matrix operations, favoring Level 3 BLAS (e.g., GEMM) over less compute-intensive operations.
- Batched Operations: Batched BLAS/LAPACK routines (available via libraries like MAGMA) are used extensively to group numerous small matrix operations (typical of the k-point method) into a single, efficient GPU kernel launch.
- Optimized Energy Evaluation: The local energy calculation, often the bottleneck, was reformulated to use batched operations over the k-point sums, achieving significant speedup (up to x20) by processing multiple terms and walkers concurrently.
- Hybrid Propagation: The hybrid propagation scheme was used, allowing walkers to propagate for multiple steps (20 iterations per block) before performing the expensive local energy evaluation and walker orthogonalization.
Commercial Applications
Section titled âCommercial ApplicationsâThis research significantly advances the capability of highly accurate electronic structure methods, enabling reliable predictive modeling for complex materials:
- Advanced Materials Discovery: Provides the necessary accuracy (approaching chemical accuracy, < 0.02 eV) to predict fundamental properties like cohesive energy, phase stability, and defect formation energies in novel solid-state materials.
- Semiconductor and Microelectronics: Enables high-fidelity simulation of correlated electron effects in crystalline solids, crucial for designing new wide-bandgap semiconductors (like diamond or SiC) and understanding quantum materials.
- High-Performance Computing (HPC) Software: The optimized GPU implementation (QMCPACK) serves as a blueprint for porting other complex quantum many-body algorithms to next-generation exascale architectures.
- Computational Catalysis and Energy: Applicable to modeling strongly correlated materials used in energy applications, such as complex oxides, battery electrodes, and heterogeneous catalysts, where standard Density Functional Theory (DFT) methods often fail.
- Quantum Technology Benchmarking: Provides highly reliable reference calculations for validating and benchmarking results obtained from emerging quantum computing hardware and algorithms.
View Original Abstract
We outline how auxiliary-field quantum Monte Carlo (AFQMC) can leverage graphical processing units (GPUs) to accelerate the simulation of solid state systems. By exploiting conservation of crystal momentum in the one- and two-electron integrals, we show how to efficiently formulate the algorithm to best utilize current GPU architectures. We provide a detailed description of different optimization strategies and profile our implementation relative to standard approaches, demonstrating a factor of 40 speedup over a CPU implementation. With this increase in computational power, we demonstrate the ability of AFQMC to systematically converge solid state calculations with respect to basis set and system size by computing the cohesive energy of carbon in the diamond structure to within 0.02 eV of the experimental result.