Abdul Dakkak is a Ph.D. candidate in Computer Science at the University of Illinois at Urbana-Champaign (UIUC) advised by Professor Wen-mei Hwu. He is a senior compiler developer at Wolfram Research, leading the Wolfram Compiler effort. Abdul's research interest lies between programming languages and accelerated computing, with a focus on compiling high-level languages into performant code running on different hardware. In the process, he has developed industry-grade tools for compiling, running, profiling, and introspecting real-world applications to optimize their performance across both the hardware and software stack. As a primary developer of the Wolfram Compiler, Abdul has developed the Wolfram type system and architected the Wolfram runtime. As a result, the compiled Wolfram code matches the speed to hand-optimized C code and can target accelerator and multi-node systems.
Abdul has been involved in teaching activities. He developed tools to enable teaching for large classrooms and is the author of WebGPU and RAI. Both WebGPU and RAI have over 100k users and are used across over 14 universities (including the University of Michigan, BSC/UPC, UIC, the University of Tennessee, …) to evaluate over 2.5 million labs. He has aided in teaching the Coursera HPP course (3 times), the introductory and advanced CUDA courses (2 times), and the PUMPS summer school at BSC (4 times).
Aside from the above, Abdul also has been developing MLModelScope, which is a distributed platform allowing people to deploy, profile, and experiment with ML/DL frameworks and models. The tools are used to inform system design for Deep Learning model serving and develop highly tuned GPU kernels for model inference.
PhD Candidate in Computer Science, 2013-
University of Illinois Urbana-Champaign
B.A. in Pure Mathematics, 2009
University of Toledo
The popularity of data- and scientific-oriented applications, abundance of on-demand compute resources, and scarcity of domain expert programmers have given rise to high-level scripting languages. These high-level scripting languages offer a fast way to translate ideas into code, but tend to pay a heavy performance overhead. In order to alleviate the performance penalty, each implementation of these languages often offers a compilation path to a subset of the language. In this paper we present the design and implementation of the Wolfram Language compiler, the production compiler for the Wolfram Language. We show how popular language features and runtime behavior, expected by Wolfram Language developers, are efficiently implemented within the compiler. We then show how the compiler provides a friction-less path to migrate programs from the interpreter to the compiler. We evaluate the compiler and show that compiled code matches the performance of highly tuned hand-written C code. The compiler has been released as a prominent feature of Wolfram Engine v12 and is readily available to developers.
There has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application domains. This has made profiling and characterization of ML model performance an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible system to serve ML models with the target latency, throughput, cost, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, framework, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack, which limits the thoroughness and usefulness of the profiling results. This paper proposes XSP — an across-stack profiling design that gives a holistic and hierarchical view of ML model execution. XSP leverages distributed tracing to aggregate and correlate profile data from different sources. XSP introduces a leveled and iterative measurement approach that accurately captures the latencies at all levels of the HW/SW stack in spite of the profiling overhead. We couple the profiling design with an automated analysis pipeline to systematically analyze $65$ state-of-the-art ML models. We demonstrate that XSP provides insights which would be difficult to discern otherwise.
As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow characterization/optimization cycles that cannot keep up with the fast pace at which new DL models are introduced. We propose Benanza, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs. Benanza consists of four major components: a model processor that parses models into an internal representation, a configurable benchmark generator that automatically generates micro-benchmarks given a set of models, a database of benchmark results, and an analyzer that computes the “lower-bound” latency of DL models using the benchmark data and informs optimizations of model execution. The “lower-bound” latency metric estimates the ideal model execution on a GPU system and serves as the basis for identifying optimization opportunities in frameworks or system libraries. We used Benanza to evaluate 30 ONNX models in MXNet, ONNX Runtime, and PyTorch on 7 GPUs ranging from Kepler to the latest Turing, and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.
Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 X 4 or 16 X 16) to accelerate HPC and deep learning workloads. Although TCUs are prevalent and promise increase in performance and/or energy efficiency and are heavily used within supercomputers to achieve exascale performance, they suffer from over specialization — with only general matrix multiplication (GEMM) operations on small matricies being supported. \n\nIn this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs. To our knowledge, this paper is the first to try to\nbroaden the class of algorithms expressible as TCU operations and\nis the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implement the algorithms using NVIDIA V100 TCUs and achieve 89% − 98% of peak memory copy bandwidth, and are orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes (common in HPC and deep learning applications). Our implementation achieves this while decreasing the power consumption by up to 22% for reduction and 16% for scan.
Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines. Cloud computing, as the de-facto backbone of modern computing infrastructure, has to be able to handle user-defined FaaS pipelines containing diverse DNN inference workloads while maintaining isolation and latency guarantees with minimal resource waste. The current solution for guaranteeing isolation and latency within FaaS is inefficient. A major cause of the inefficiency is the need to move large amount of data within and across servers. We propose TrIMS as a novel solution to address this issue. TrIMS is a generic memory sharing technique that enables constant data to be shared across processes or containers while still maintaining isolation between users. TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of abstracts, application APIs, and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models, up to 210x speedup for large models, and up to 8x system throughput improvement.