default search action
ICPP 2024: Gotland, SwedenA
- Proceedings of the 53rd International Conference on Parallel Processing, ICPP 2024, Gotland, Sweden, August 12-15, 2024. ACM 2024, ISBN 979-8-4007-1793-2
Algorithm Optimization
- Wojciech Kwedlo:
Parallel Iterative Mistake Minimization (IMM) clustering algorithm for shared-memory systems. 1-10 - Subhajit Sahu, Kishore Kothapalli, Dip Sankar Banerjee:
Fast Leiden Algorithm for Community Detection in Shared Memory Setting. 11-20 - Xianglin Wang, Xin Yi, Hengbiao Yu, Chun Huang, Lin Peng:
Parallel Optimization for Accelerating the Generation of Correctly Rounded Elementary Functions. 21-31 - Abhishek V. N. Taraka Josyula, Pritesh Verma, Amar Gaonkar, Amlan Barua, Nikhil Hegde:
Optimizing a Super-Fast Eigensolver for Hierarchically Semiseparable Matrices. 32-41
Best Paper Finalists
- Donney Fan, Ben Liang:
Online Non-preemptive Multi-Resource Scheduling for Weighted Completion Time on Multiple Machines. 42-51 - Yi Zong, Peinan Yu, Haopeng Huang, Wei Xue:
FP16 Acceleration in Structured Multigrid Preconditioner for Real-World Applications. 52-62 - Kan Zhong, Zhiwang Yu, Qiao Li, Xianqiang Luo, Linbo Long, Yujuan Tan, Ao Ren, Duo Liu:
DPC: DPU-accelerated High-Performance File System Client. 63-72
Co-design
- Sonia Rani Gupta, Nikela Papadopoulou, Jing Chen, Miquel Pericàs:
Co-Design of Convolutional Algorithms and Long Vector RISC-V Processors for Efficient CNN Model Serving. 73-83 - Quentin Anthony, Jacob Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda:
The Case for Co-Designing Model Architectures with Hardware. 84-96 - Qifeng Pan, Ralf Schneider:
Improving efficiency of Monte Carlo method via code intrinsic framework. 97-106 - Yongseok Soh, Ramakrishnan Kannan, Piyush Sao, Jee W. Choi:
Accelerated Constrained Sparse Tensor Factorization on Massively Parallel Architectures. 107-116
Communication and Networks
- Ujjaini Mukhopadhyay, Alok Tripathy, Oguz Selvitopi, Katherine A. Yelick, Aydin Buluç:
Sparsity-Aware Communication for Distributed Graph Neural Network Training. 117-126 - Jing Xu, Zhan Wang, Fan Yang, Ning Kang, Zhenlong Ma, Guojun Yuan, Guangming Tan, Ninghui Sun:
FNCC: Fast Notification Congestion Control in Data Center Networks. 127-137 - Wen Xu, Juncheng Wang, Ben Liang, Gary Boudreau, Hamza Sokun:
Distributed Minimax Fair Optimization over Hierarchical Networks. 138-147
Communication and Scalability
- Jing Peng, Zihan Li, Shaohuai Shi, Bo Li:
Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep Learning. 148-157 - Xinbiao Gan, Tiejun Li, Qiang Zhang, Bo Yang, Xinhai Chen, Jie Liu:
SuperCSR: A Space-Time-Efficient CSR Representation for Large-scale Graph Applications on Supercomputers. 158-167 - Tim Beringer, Jakob Stock, Arya Mazaheri, Felix Wolf:
Dissecting Convolutional Neural Networks for Runtime and Scalability Prediction. 168-178
GPU Memory
- Jiajian Zhang, Fangyu Wu, Hai Jiang, Guangliang Cheng, Genlang Chen, Qiufeng Wang:
SyncMalloc: A Synchronized Host-Device Co-Management System for GPU Dynamic Memory Allocation across All Scales. 179-188 - Abdun Nihaal, Madhu Mutyam:
Selective Memory Compression for GPU Memory Oversubscription Management. 189-198 - Gabin Schieffer, Jacob Wahlgren, Jie Ren, Jennifer Faj, Ivy Peng:
Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper. 199-209
In-situ Workflow
- Jiahui Liu, Tobias Edwards, Kristina Durovic, Philipp Schlatter, Tino Weinkauf:
In-Situ Binary Segmentation of 3D time-dependent Flows into Laminar and Turbulent Regions. 210-219 - Dewi Yokelson, Mikhail Titov, Srinivasan Ramesh, Ozgur O. Kilic, Matteo Turilli, Shantenu Jha, Allen D. Malony:
Enabling Performance Observability for Heterogeneous HPC Workflows with SOMA. 220-230 - Jaime Cernuda, Jie Ye, Anthony Kougkas, Xian-He Sun:
HStream: A hierarchical data streaming engine for high-throughput scientific applications. 231-240
Parallel Algorithm
- Jialin Li, Zhichen Feng, Yaqian Gao, Shaobo Tian, Haoyuan Zhang, Huang Ye, Jian Zhang:
High-Performance 3D convolution on the Latest Generation Sunway Processor. 241-251 - Gaurav Bhardwaj, Bapi Chatterjee, Abhinav Sharma, Sathya Peri, Siddharth Nayak:
Kanva: A Lock-free Learned Search Data Structure. 252-261 - Haopeng Huang, Yuyang Jin, Wei Xue:
BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System. 262-272
Parallel Language
- Buddhi Ashan Mallika Kankanamalage, Satish Puri, Sushil K. Prasad:
Extending Segment Tree for Polygon Clipping and Parallelizing using OpenMP and OpenACC Directives. 273-283 - Ruben Laso, Diego Krupitza, Sascha Hunold:
Exploring Scalability in C++ Parallel STL Implementations. 284-293 - Zaman Lantra, Steven A. Wright, Gihan R. Mudalige:
OP-PIC - an Unstructured-Mesh Particle-in-Cell DSL for Developing Nuclear Fusion Simulations. 294-304
Scheduling Cloud
- Svetlana Kulagina, Henning Meyerhenke, Anne Benoit:
Mapping Large Memory-constrained Workflows onto Heterogeneous Platforms✱. 305-316 - Liang Zhang, Hongzi Zhu, Yunzhe Li, Jiangang Shen, Minyi Guo:
The Blind and the Elephant: A Preference-aware Edge Video Analytics Scheduler for Maximizing System Benefit. 317-326 - Tomasz Kanas, Krzysztof Rzadca:
Diminishing cold starts in serverless computing with approximation algorithms. 327-336 - Jiawei Huang, Qile Wang, Zhaoyi Li, Yijun Li, Zihao Chen, Sitan Li, Jing Shao, Jingling Liu, Min Zhan, Jianxin Wang:
Achieving Efficient Scheduling based on Accurate Measurement of Small Flows in Data Center. 337-346 - Huadong Li, Hui Liu, Aoqi Chen, Xirui Ma, Junzhao Du:
Thawbringer: An Orchestrator to Mitigate Cascading Cold Starts of Serverless Function Chains. 347-356 - Ying Zheng, Lei Jiao, Han Yang, Lulu Chen, Ying Liu, Yuxiao Wang, Yuedong Xu, Xin Wang, Zongpeng Li:
Online Scheduling and Pricing for Multi-LoRA Fine-Tuning Tasks. 357-366 - Xin Tan, Jiamin Li, Yitao Yang, Jingzong Li, Hong Xu:
Arlo: Serving Transformer-based Language Models with Dynamic Input Lengths. 367-376
Scientific Simulations
- Yaqian Gao, Jian Zhang, Huang Ye, Xuebin Chi:
Large-scale Phase-Field Simulations for Solid-Solid Phase Transformations involving Elastic Energy. 377-387 - Shui Jiang, Rongliang Fu, Lukas Burgholzer, Robert Wille, Tsung-Yi Ho, Tsung-Wei Huang:
FlatDD: A High-Performance Quantum Circuit Simulator using Decision Diagram and Flat Array. 388-399 - Yi Zhang, Ziyu Zhang, Yang Zhao, Junshi Chen, Hong An, Zhanming Wang, Longkui Chen:
Multi-level Load Balancing Strategies for Massively Parallel Smoothed Particle Hydrodynamics Simulation. 400-410 - Ran Zhao, Chao Li, Xiaowei Guo, Sen Zhang, Xi Yang, Tao Tang, Canqun Yang:
A Motion Trace Decomposition-based overset grid method for parallel CFD simulations with moving boundaries. 411-420
Distributed Systems
- Conor James Green, Mithuna Thottethodi:
NetSmith: An Optimization Framework for Machine-Discovered Network Topologies. 421-432 - Chen Chen, Li Shen, Yingwen Chen:
A Distributed Framework for Subgraph Isomorphism Leveraging CPU and GPU Heterogeneous Computing. 433-442 - Jinbin Hu, Ying Liu, Hao Wang, Jin Wang:
AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU Cluster. 443-452 - Ruisong Zhou, Yuzhan Zhang, Chunhua Li, Ke Zhou, Peng Wang, Gong Zhang, Ji Zhang, Guangyu Zhang:
HyperDB: a Novel Key Value Store for Reducing Background Traffic in Heterogeneous SSD Storage. 453-463
Federated Learning
- Fuyuan Xia, Chenhao Ying, David S. L. Wei, Wei Chen, Weiting Zhang, Haiming Jin, Yuan Luo:
ChronusFed: Reinforcement-Based Adaptive Partial Training for Heterogeneous Federated Learning. 464-473 - Md Sirajul Islam, Simin Javaherian, Fei Xu, Xu Yuan, Li Chen, Nian-Feng Tzeng:
FedClust: Tackling Data Heterogeneity in Federated Learning through Weight-Driven Client Clustering. 474-483 - Jiangshan Hao, Fang Dong, Bingheng Cen, Shucun Fu, Ruiting Zhou, Ding Ding:
HASFL: Harnessing Heterogeneous Models Across Diverse Devices for Enhanced Federated Learning. 484-493 - Na Lv, Zhi Shen, Chen Chen, Zhifeng Jiang, Jiayi Zhang, Quan Chen, Minyi Guo:
FedCA: Efficient Federated Learning with Client Autonomy. 494-503
GPU Cluster Optimization
- Bowen Zhang, Shuxin Li, Zhuozhao Li:
MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning Clusters. 504-513 - Fei Yang, Shuang Peng, Ning Sun, Fangyu Wang, Yuanyuan Wang, Fu Wu, Jiezhong Qiu, Aimin Pan:
Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment. 514-523 - Bowen Yuchi, Heng Shi, Guoqing Bao:
SPHINX: Search Space-Pruning Heterogeneous Task Scheduling for Deep Neural Networks. 524-533
Graph on GPU
- Chenle Yu, Sara Royuela, Eduardo Quiñones:
Enhancing Heterogeneous Computing Through OpenMP and GPU Graph. 534-543 - Shinnung Jeong, Sungjun Cho, Yongwoo Lee, Hyunjun Park, Seonyeong Heo, Gwangsun Kim, Youngsok Kim, Hanjun Kim:
CR2: Community-aware Compressed Regular Representation for Graph Processing on a GPU. 544-554 - Chen Zhao, Ting Yu, Zhigao Zheng, Yuanyuan Zhu, Song Jin, Bo Du, Dacheng Tao:
SpeedCore: Space-efficient and Dependency-aware GPU Parallel Framework for Core Decomposition. 555-564 - Chih-Chun Chang, Boyang Zhang, Tsung-Wei Huang:
GSAP: A GPU-Accelerated Stochastic Graph Partitioner. 565-575 - Mahesh Lakshminarasimhan, Mary W. Hall, Samuel Williams, Oscar Antepara:
BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs. 576-586 - Mithinti Srikanth, Prashant Singh, G. Ramakrishna:
GPU Algorithms for Fastest Path Problem in Temporal Graphs. 587-596
Memory and Storage
- Wenda Tang, Ying Han, Tianxiang Ai, Guanghui Li, Bin Yu, Xin Yang:
Yggdrasil: Reducing Network I/O Tax with (CXL-Based) Distributed Shared Memory. 597-606 - Ziwei Xiong, Dejun Jiang, Jin Xiong:
DiStore: A Fully Memory Disaggregation Friendly Key-Value Store with Improved Tail Latency and Space Efficiency. 607-617 - Liuying Ma, Zhenqing Liu, Jin Xiong, Yue Wu, Renhai Chen, Xi Peng, Ying Zhang, Gong Zhang, Dejun Jiang:
zQoS: Unleashing full performance capabilities of NVMe SSDs while enforcing SLOs in distributed storage systems. 618-628
Memory Optimization
- Stavroula Zouzoula, Mohammad Ali Maleki, Muhammad Waqar Azhar, Pedro Trancoso:
Scratchpad Memory Management for Deep Learning Accelerators. 629-639 - Jihu Guo, Rui Xia, Jie Liu, Xiaoxiong Zhu, Xiang Zhang:
CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPU. 640-649 - Qisheng Jiang, Lei Jia, Chundong Wang:
GNNDrive: Reducing Memory Contention and I/O Congestion for Disk-based GNN Training. 650-659 - XinYu Piao, Jong-Kook Kim:
GMM: An Efficient GPU Memory Management-based Model Serving System for Multiple DNN Inference Models. 660-668
Performance Optimization
- Kaveh Mahdavi:
A Hybrid Machine Learning Method for Cross-Platform Performance Prediction of Parallel Applications. 669-678 - Fugeng Zhu, Xinxin Qi, Peng Zhang, Jianbin Fang, Tao Tang, Yonggang Che, Kainan Yu, Jing Xie, Chun Huang, Jie Ren:
Optimizing Stencil Computation on Multi-core DSPs. 679-690 - Ricardo Jesus, Michèle Weiland:
Evaluating and optimising compiler code generation for NVIDIA Grace. 691-700
Resource Allocation
- Yingwen Chen, Wenxin Li, Huan Zhou, Xiangrui Yang, Yanfei Yin:
DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasks. 701-711 - Jiazhen Zhu, Wenda Tang, Xianglong Meng, Nan Gong, Tianxiang Ai, Guanghui Li, Bin Yu, Xin Yang:
PheCon: Fine-Grained VM Consolidation with Nimble Resource Defragmentation in Public Cloud Platforms. 712-721 - Dingyu Yang, Ziyang Xiao, Dongxiang Zhang, Shuhao Zhang, Jian Cao, Gang Chen:
PREACT: Predictive Resource Allocation for Bursty Workloads in a Co-located Data Center. 722-731
Scheduling Cloud
- Siyuan Chen, Decheng Zuo, Zhan Zhang:
FlexSP: (1 + β)-Choice based Flexible Stream Partitioning for Stateful Operators. 732-741 - Wen Gao, Zhiwen Yu, Hui Xiong, Bin Guo, Liang Wang, Yuan Yao:
Parallel Task Scheduling in Autonomous Robotic Systems: An Event-Driven Multimodal Prediction Approach. 742-751 - Bin Gao, Zhehui Wang, Zhuomin He, Tao Luo, Weng-Fai Wong, Zhi Zhou:
IMI: In-memory Multi-job Inference Acceleration for Large Language Models. 752-761
Scheduling Edge
- Bei Ouyang, Shengyuan Ye, Liekang Zeng, Tianyi Qian, Jingyi Li, Xu Chen:
Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-tuning. 762-771 - Huadong Li, Hui Liu, Aoqi Chen, Xirui Ma, Qiaoqiao Liu, Junzhao Du:
RIA: Return on Investment Auto-scaler for Serverless Edge Functions. 772-781 - Yan Zhuang, Zhenzhe Zheng, Yunfeng Shao, Bingshuai Li, Fan Wu, Guihai Chen:
Nebula: An Edge-Cloud Collaborative Learning Framework for Dynamic Edge Environments. 782-791 - Jieyu Lin, Minghao Li, Sai Qian Zhang, Alberto Leon-Garcia:
Murmuration: On-the-fly DNN Adaptation for SLO-Aware Distributed Inference in Dynamic Edge Environments. 792-801
Tools
- Praseetha M, Madhu Mutyam, Venkata Kalyan Tavva:
Cache Line Pinning for Mitigating Row Hammer Attack. 802-811 - Jie Ye, Jaime Cernuda, Neeraj Rajesh, Keith Bateman, Orcun Yildiz, Tom Peterka, Arnur Nigmetov, Dmitriy Morozov, Xian-He Sun, Anthony Kougkas, Bogdan Nicolae:
Viper: A High-Performance I/O Framework for Transparently Updating, Storing, and Transferring Deep Neural Network Models. 812-821 - Siyu Wu, Hailong Yang, Xin You, Ruihao Gong, Yi Liu, Zhongzhi Luan, Depei Qian:
PRoof: A Comprehensive Hierarchical Profiling Framework for Deep Neural Networks with Roofline Analysis. 822-832 - Simon Schwitanski, Yussur Mustafa Oraji, Cornelius Pätzold, Joachim Jenke, Felix Tomski, Matthias S. Müller:
RMASanitizer: Generalized Runtime Detection of Data Races in Remote Memory Access Applications. 833-844
Compression
- Tri Nguyen, Md Hasanur Rahman, Sheng Di, Michela Becchi:
Significantly Improving Fixed-Ratio Compression Framework for Resource-limited Applications. 845-855 - André Weißenberger, Bertil Schmidt:
Massively Parallel Inverse Block-sorting Transforms for bzip2 Decompression on GPUs. 856-865 - Zichen Tang, Junlin Huang, Rudan Yan, Yuxin Wang, Zhenheng Tang, Shaohuai Shi, Amelie Chi Zhou, Xiaowen Chu:
Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning. 866-875
Configurable Hardware
- Kyle Zhao Bin Chen, Tarek S. Abdelrahman, Reza Azimi, Tomasz S. Czajkowski, Maziar Goudarzi:
RoDMap: A Reserve-on-Demand Mapper for Spatially-Configured Coarse-Grained Reconfigurable Arrays. 876-886 - Jie Cheng, Lifu Hu, Wei Xu, Hanhua Chen, Tian Xia:
Hardware Acceleration of Minimap2 Genomic Sequence Alignment Algorithm. 887-897 - Weilin Zhu, Wei Tong, Hujun Ge, Zuoxian Zhang, Mengran Zhang, Wen Zhou:
LpaqHP: A High-Performance FPGA Accelerator for LPAQ Compression. 898-907
Distributed Memory
- Piyush Sao, Andrey Prokopenko, Damien Lebrun-Grandié:
PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPU. 908-918 - Yifan Li, Giulia Guidi:
High-Performance Sorting-Based K-mer Counting in Distributed Memory with Flexible Hybrid Parallelism. 919-928 - Rui Zhang, Yukai Huang, Sicheng Liang, Shangyi Sun, Shaonan Ma, Chengying Huan, Lulu Chen, Zhihui Lu, Yang Xu, Ming Yan, Jie Wu:
Revisiting Learned Index with Byte-addressable Persistent Storage. 929-938
Energy-aware Computing
- Hanfei Geng, Yi Sun, Yuanzhe Li, Jichao Leng, Xiangyu Zhu, Xianyuan Zhan, Yuanchun Li, Feng Zhao, Yunxin Liu:
TESLA: Thermally Safe, Load-Aware, and Energy-Efficient Cooling Control System for Data Centers. 939-949 - Hanlong Liao, Guoming Tang, Deke Guo, Yi Wang, Ruide Cao:
Rethinking Low-Carbon Edge Computing System Design with Renewable Energy Sharing. 950-960 - Tiago Da Silva Barros, Davide Ferré, Frédéric Giroire, Ramon Aparicio-Pardo, Stephane Perennes:
Scheduling Machine Learning Compressible Inference Tasks with Limited Energy Budget. 961-970
Federated Learning
- Haoyu Chen, Yuxin Zhang, Jin Zhao, Xin Wang, Yuedong Xu:
Gradient Free Personalized Federated Learning. 971-980 - Yinlong Li, Hao Zhang, Siyao Cheng, Jie Liu:
Federated Edge Learning with Blurred or Pseudo Data Sharing. 981-990 - Dezhong Yao, Ziquan Zhu, Tongtong Liu, Zhiqiang Xu, Hai Jin:
Rethinking Personalized Federated Learning from Knowledge Perspective. 991-1000
GPU Optimization
- Xu Zhang, Guangda Zhang, Lu Wang, Shiqing Zhang, Xia Zhao:
AdCoalescer: An Adaptive Coalescer to Reduce the Inter-Module Traffic in MCM-GPUs. 1001-1011 - Jaebeom Jeon, Minseong Gil, Junsu Kim, Jaeyong Park, Gunjae Koo, Myung Kuk Yoon, Yunho Oh:
VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing. 1012-1021 - Qianchao Zhu:
FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUs. 1022-1031
Memory-centric Computing
- Pingdan Xiao, Qinghui Hong, Sichun Du, Jiliang Zhang:
CIM-KF: Efficient Computing-in-memory Circuits for Full-Process Execution of Kalman Filter Algorithm. 1032-1041 - Mohammad Sabri Abrebekoh, Marc Riera Villanueva, Antonio González:
ReDy: A Novel ReRAM-centric Dynamic Quantization Approach for Energy-efficient CNNs. 1042-1051 - Tong Wu, Shuibing He, Jianxin Zhu, Weijian Chen, Siling Yang, Ping Chen, Yanlong Yin, Xuechen Zhang, Xian-He Sun, Gang Chen:
AUTOHET: An Automated Heterogeneous ReRAM-Based Accelerator for DNN Inference. 1052-1061 - Meven Mognol, Dominique Lavenier, Julien Legriel:
Parallelization of the Banded Needleman & Wunsch Algorithm on UPMEM PiM Architecture for Long DNA Sequence Alignment. 1062-1071
Simulations on GPU
- Zhiyi Zhang, Pengfei Zhang, Zhuopin Xu, Bingjie Yan, Qi Wang:
Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUs. 1072-1081 - Taisuke Boku, Masatake Sugita, Ryohei Kobayashi, Shinnosuke Furuya, Takuya Fujie, Masahito Ohue, Yutaka Akiyama:
Improving Performance on Replica-Exchange Molecular Dynamics Simulations by Optimizing GPU Core Utilization. 1082-1091