Abdul Dakkak is a Principal Research Software Engineer at Microsoft Research AI (MSRAI) working on next-generation compilers for end-to-end machine learning. Before then, Abdul Dakkak received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign (UIUC). He was a senior compiler developer at Wolfram Research, leading the design and development of the Wolfram Compiler effort for over 6 years. Abdul’s research interest lies at the intersection of machine learning, compilers, programming languages, and accelerated computing. His focus is compiling high-level languages into performant code running on different hardware. In the process, he has developed industry-grade tools for compiling, running, profiling, and introspecting real-world applications to optimize their performance across both the hardware and software stack. As a primary developer of the Wolfram Compiler, Abdul has developed the Wolfram type system and architected the Wolfram runtime. As a result, the compiled Wolfram code matches the speed to hand-optimized C code and can target accelerator and multi-node systems.
Aside from the compiler work, Abdul also has been developing MLModelScope, which is a distributed platform allowing people to deploy, profile, and experiment with ML/DL frameworks and models. The tools are used to inform system design for Deep Learning model serving and develop highly tuned GPU kernels for model inference.
Abdul has been involved in teaching activities. He developed tools to enable teaching for large classrooms and is the author of WebGPU and RAI. Both WebGPU and RAI have over 100k users and are used across over 14 universities (including the University of Michigan, BSC/UPC, UIC, the University of Tennessee, …) to evaluate over 2.5 million labs. He has aided in teaching the Coursera HPP course (3 times), the introductory and advanced CUDA courses (2 times), and the PUMPS summer school at BSC (4 times).
PhD in Computer Science, 2013-2020
University of Illinois Urbana-Champaign
B.A. in Pure Mathematics, 2009
University of Toledo
The fast Fourier Transform (FFT), a reduced-complexity formulation of the Discrete Fourier Transform (DFT), is an important tool in many areas of science and engineering. FFTW is a well-known package that follows this approach and is currently one of the fastest available implementations of the FFT. NVIDIA introduced its version of FFTW called cuFFT that achieves high performance on the GPUs. In this work we present a novel way to map the FFT algorithm on the newly introduced Tensor Cores by adapting the the Cooley-Tukey recursive FFT algorithm. We present four major types of optimizations that enhance the performance of our approach for varying FFT sizes and show that the approach consistently outperforms cuFFT with a speedup of about 15% to 250% on average.
The current Deep Learning (DL) landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks a DL benchmarking platform to facilitate evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. Due to the lack of a benchmarking platform, the current practice of evaluating the benefits of proposed DL innovations is both arduous and error-prone - stifling the adoption of the innovations. In this work, we first identify 10 design features which are desirable within a DL benchmarking platform. These features include: performing the evaluation in a consistent, reproducible, and scalable manner, being framework and hardware agnostic, supporting real-world benchmarking workloads, providing in-depth model execution inspection across the HW/SW stack levels, etc. We then propose MLModelScope, a DL benchmarking platform design that realizes the 10 objectives. MLModelScope proposes a specification to define DL model evaluations and techniques to provision the evaluation workflow using the user-specified HW/SW stack. MLModelScope defines abstractions for frameworks and supports board range of DL models and evaluation scenarios. We implement MLModelScope as an open-source project with support for all major frameworks and hardware architectures. Through MLModelScope’s evaluation and automated analysis workflows, we performed case-study analyses of 37 models across 4 systems and show how model, hardware, and framework selection affects model accuracy and performance under different benchmarking scenarios. We further demonstrated how MLModelScope’s tracing capability gives a holistic view of model execution and helps pinpoint bottlenecks.
Deep Learning (DL) innovations are being introduced at a rapid pace. However, the current lack of standard specification of DL tasks makes sharing, running, reproducing, and comparing these innovations difficult. To address this problem, we propose DLSpec, a model-, dataset-, software-, and hardware-agnostic DL specification that captures the different aspects of DL tasks. DLSpec has been tested by specifying and running hundreds of DL tasks.
As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow characterization/optimization cycles that cannot keep up with the fast pace at which new DL models are introduced. We propose Benanza, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs. Benanza consists of four major components: a model processor that parses models into an internal representation, a configurable benchmark generator that automatically generates micro-benchmarks given a set of models, a database of benchmark results, and an analyzer that computes the “lower-bound” latency of DL models using the benchmark data and informs optimizations of model execution. The “lower-bound” latency metric estimates the ideal model execution on a GPU system and serves as the basis for identifying optimization opportunities in frameworks or system libraries. We used Benanza to evaluate 30 ONNX models in MXNet, ONNX Runtime, and PyTorch on 7 GPUs ranging from Kepler to the latest Turing, and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.
The past few years have seen a surge of applying Deep Learning (DL) models for a wide array of tasks such as image classification, object detection, machine translation, etc. While DL models provide an opportunity to solve otherwise intractable tasks, their adoption relies on them being optimized to meet latency and resource requirements. Benchmarking is a key step in this process but has been hampered in part due to the lack of representative and up-to-date benchmarking suites. This is exacerbated by the fast-evolving pace of DL models. This paper proposes DLBricks, a composable benchmark generation design that reduces the effort of developing, maintaining, and running DL benchmarks on CPUs. DLBricks decomposes DL models into a set of unique runnable networks and constructs the original model’s performance using the performance of the generated benchmarks. DLBricks leverages two key observations: DL layers are the performance building blocks of DL models and layers are extensively repeated within and across DL models. Since benchmarks are generated automatically and the benchmarking time is minimized, DLBricks can keep up-to-date with the latest proposed models, relieving the pressure of selecting representative DL models. Moreover, DLBricks allows users to represent proprietary models within benchmark suites. We evaluate DLBricks using 50 MXNet models spanning 5 DL tasks on 4 representative CPU systems. We show that DLBricks provides an accurate performance estimate for the DL models and reduces the benchmarking time across systems (e.g. within 95% accuracy and up to 4.4× benchmarking time speedup on Amazon EC2 c5.xlarge).
The popularity of data- and scientific-oriented applications, abundance of on-demand compute resources, and scarcity of domain expert programmers have given rise to high-level scripting languages. These high-level scripting languages offer a fast way to translate ideas into code, but tend to pay a heavy performance overhead. In order to alleviate the performance penalty, each implementation of these languages often offers a compilation path to a subset of the language. In this paper we present the design and implementation of the Wolfram Language compiler, the production compiler for the Wolfram Language. We show how popular language features and runtime behavior, expected by Wolfram Language developers, are efficiently implemented within the compiler. We then show how the compiler provides a friction-less path to migrate programs from the interpreter to the compiler. We evaluate the compiler and show that compiled code matches the performance of highly tuned hand-written C code. The compiler has been released as a prominent feature of Wolfram Engine v12 and is readily available to developers.
There has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application domains. This has made profiling and characterization of ML model performance an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible system to serve ML models with the target latency, throughput, cost, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, framework, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack, which limits the thoroughness and usefulness of the profiling results. This paper proposes XSP — an across-stack profiling design that gives a holistic and hierarchical view of ML model execution. XSP leverages distributed tracing to aggregate and correlate profile data from different sources. XSP introduces a leveled and iterative measurement approach that accurately captures the latencies at all levels of the HW/SW stack in spite of the profiling overhead. We couple the profiling design with an automated analysis pipeline to systematically analyze $65$ state-of-the-art ML models. We demonstrate that XSP provides insights which would be difficult to discern otherwise.
Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 X 4 or 16 X 16) to accelerate HPC and deep learning workloads. Although TCUs are prevalent and promise increase in performance and/or energy efficiency and are heavily used within supercomputers to achieve exascale performance, they suffer from over specialization — with only general matrix multiplication (GEMM) operations on small matricies being supported. \n\nIn this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs. To our knowledge, this paper is the first to try to\nbroaden the class of algorithms expressible as TCU operations and\nis the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implement the algorithms using NVIDIA V100 TCUs and achieve 89% − 98% of peak memory copy bandwidth, and are orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes (common in HPC and deep learning applications). Our implementation achieves this while decreasing the power consumption by up to 22% for reduction and 16% for scan.
Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines. Cloud computing, as the de-facto backbone of modern computing infrastructure, has to be able to handle user-defined FaaS pipelines containing diverse DNN inference workloads while maintaining isolation and latency guarantees with minimal resource waste. The current solution for guaranteeing isolation and latency within FaaS is inefficient. A major cause of the inefficiency is the need to move large amount of data within and across servers. We propose TrIMS as a novel solution to address this issue. TrIMS is a generic memory sharing technique that enables constant data to be shared across processes or containers while still maintaining isolation between users. TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of abstracts, application APIs, and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models, up to 210x speedup for large models, and up to 8x system throughput improvement.