The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform frameworks, models, and system stacks but lacks standard tools to facilitate the evaluation and measurement of models. Due to the absence of such tools, the current practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations.
We propose MLModelScope— a hardware/software agnostic platform to facilitate the evaluation, measurement, and introspection of ML models within AI pipelines. MLModelScope provides a consistent evaluation, profiling, aggregation, and reporting system to help pinpoint the source of system bottlenecks, and automate the evaluation and performance aggregation of models across frameworks and systems.
Replication of model accuracy and performance results is dependent on: the usage of specific HW/SW stack; the training dataset; and the pre/post-processing steps on the inputs and outputs. MLModelScope specifies these requirements via a model manifest file. To maintain the SW stack and to guarantee isolation, evaluation occurs within Docker containers. Within MLModelScope, frameworks are exposed by a common abstraction layer and are referred to as predictors. Predictors are responsible for evaluating models (using the manifest file) and capturing the profile information from framework, system and hardware profilers. MLModeScope leverages distributed tracing to aggregate the traces from different profilers into a single timeline. All evaluation and profile data is placed in a database and aggregated offline.
Currently MLModelScope 1) supports TensorRT, TensorFlow, PyTorch, MXNet, Caffe, Caffe2, and CNTK, 2) runs on ARM, PowerPC, and X86 with CPU, GPU, and FPGA, 3) has built-in framework, library and system profilers. MLModelScope allows users to extend and customize it by adding models, frameworks, or library and system profilers.
Using MLModelScope, we evaluated AlexNet across six frameworks on six representative systems with Kepler, Maxwell, Pascal and Volta GPUs. The work presents the model latency across frameworks, sub-model and sub-layer latency analysis, and model latency across systems. The results show that TensorRT performs the best across different batch sizes; different frameworks might choose different GPU kernels for the same convolution layer; data copy dominates a non-persistent inference and a NVLink+Pascal system can perform better than a non-NVLink+Volta system. This case study shows how MLModelScope helps compare different HW/SW offerings and gives users a holistic view into the execution of models at different granularities.