Tutorial at IISWC 2019 - Challenges and Solutions for End-to-End and Across Stack ML Benchmarking

Abstract

The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks. It lacks standard tools and methodologies to evaluate and profile models or systems. Due to the absence of standard tools, the state of the practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations in a rapidly moving field.

Date
Nov 3, 2019 3:30 PM
Location
Orlando, FL

The goal of this tutorial is to discuss these challenges and solutions that will help address issues arising from evaluating ML models. The tutorial will educate audience on both evaluation scenarios and hardware metrics (such as different evaluation load behaviors, power efficiency, and utilization) that should be captured by benchmarking. The tutorial will also educate the attendees on state-of-the-art tools and best practices developed at the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR), some of which has won Best Research Paper Award at well-known international conferences. The tutorial will bring together experts from both industry and academia to discuss how these tools and methodologies can be leveraged for:

  1. Effective ML benchmarking across hardware and software stacks
  2. Repeatable and consistent ML benchmarking to characterize the performance of models, frameworks, and hardware
  3. Identifying pitfalls and myths of current ML benchmarking methodologies
  4. Utilizing a model evaluation specification to help model authors and framework developers communicate evaluation parameters with the hardware and system communities

Outcomes

The tutorial will present the challenges faced in optimizing and profiling within the ML domain. It will show how both the software and hardware stack contribute to the performance, accuracy, and efficiency of end-to-end model evaluation. This will be done by first identifying some common accuracy and performance pitfalls and myths prevalent in both the ML and architecture community.

Through the tutorial,attendees will learn how to identify performance bottlenecks of current models, frameworks, and hardware by leveraging state-of-the-art tools. The presenters will then use these tools to support both architectural and software claims made during the course of the presentation. The tutorial will also train the audience on how to use these tools to support similar claims by easily traversing from an application performance profile all the way down to the hardware details to identify application bottlenecks.

Coverage

The tutorial will take a top down approach starting introducing the models, then frameworks, then system libraries, and finally hardware and accelerators.

  1. Models: the tutorial will look at a diverse set of models including image classification, image segmentation, style transfer, etc. it will identify common layers (Conv, BN, RELU, etc…) and building block patterns (Inception, ResNet, etc…) within these networks and present tools on how to extract these patterns. We will use these to perform data-driven approach on why certain compiler and hardware optimizations are prevalent within industry and research. We will also identify patterns within the weights either for quantization or compression of models.

  2. Frameworks: we will look at popular frameworks such as TensorFlow, TFLite, TensorRT, Pytorch, Caffe, Caffe2, Onnx Runtime, and CNTK. We will compare the frameworks by looking at both their performance and ease of programmability. For example, we will look at how both the eager or lazy evaluation approaches affect the programmability of a model and how different storage formats affect their performance.

  3. System Libraries: the tutorial will look at how the end-to-end accuracy is affected by the underlying system libraries such as JPEG decoding methods. The tutorial will also examine the performance of BLAS and DNN functions from vendors such as cuDNN and MKL-DNN, how they are used, how much they utilize the hardware resources, and potential optimizations to either the system libraries or how they are called.

  4. Hardware: we will finally show how the above affects the hardware or accelerators running the model. We will show how one can identify the hardware behavior and map the behavior back to the application source code. We will show both the efficiency and utilization of current generation accelerators for both edge and server workloads.