An open-source, framework and hardware agnostic, extensible and customizable, distributed platform design for evaluating and profiling ML models across datasets/frameworks/systems.
Leveraging NVIDIA’s Tensor Cores to express Collectives with matrix multiplication and exploring the benefits in terms of program simplicity, efficiency, and performance.
Automatic μBenchmark Generation to Compute “Lower-bound” Latency and Inform Optimizations of Deep Learning Models on GPUs.
Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function as a Service Environments.
An extendable and customizable GPU benchmarking framework
A Scalable Project Submission System for Parallel Programming Courses.
A Scalable Lab Submission System for Parallel Programming Courses.