The past few years have seen a surge of using Machine Learning (ML) and Deep Learning (DL) algorithms for traditional HPC tasks such as feature detection, numerical analysis, and graph analytics. While ML and DL help solving HPC tasks, their adoption has been hampered in part because of the complexity of understanding ML/DL and their interactions with systems utilization. Optimizing these algorithms requires characterizing their performance and resource utilization across the hardware/software (HW/SW) stack, but the lack of easy-to-use tools to automate the process and the reliance on researchers to perform manual characterization are the bottlenecks. To alleviate this, we propose an across-stack profiling scheme and integrate it within MLModelScope — a hardware and software agnostic tool for evaluating and benchmarking ML/DL at scale. We demonstrate the across-stack profiling and characterization functionality through the evaluation of state-of-art ML/DL models and present insights that are only made possible through this design.