Winter Seminar Series

Sat, March 2, 7:00 PM

90 MINUTES

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Abstract

ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources, and failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In this talk, I will present our recent solutions to the above-mentioned challenges: InfAdapter and IPA. InfAdapter combines model-switching and auto-scaling to trade accuracy, cost, and latency in inference serving systems, while IPA dynamically re-configures ML inference pipelines to achieve the tradeoff.

InfAdapter: InfAdapter proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function of accuracy and cost. Model variants are different versions of pre-trained models for the same ML task with variations in resource requirements, latency, and accuracy. We show that serving multiple model variants instead of a single variant can enable more granular adaptation space, achieving better tradeoffs among the conflicting objectives.

IPA: IPA extends InfAdapter to enable runtime adaptation at the pipeline level that composes multiple ML services within a DAG structure. IPA dynamically configures batch size, resources, and model variants to optimize accuracy, minimize costs, and meet user-defined latency SLAs using Integer Programming. It supports multi-objective settings for achieving different tradeoffs between accuracy and cost objectives while remaining responsive to varying workloads and dynamic traffic patterns.

Presenter

Pooyan Jamshidi

Assistant Professor @ University of South Carolina, Google

Pooyan Jamshidi is an assistant professor in the Computer Science and Engineering department at the University of South Carolina. Before his current position, Pooyan was a postdoctoral researcher at Carnegie Mellon University (2016 - 2018) and Imperial College London (2014 - 2016), and he received a PhD in computer science from Dublin City University in 2014. Pooyan's research interests span the areas of Software, Systems, AI/ML, and Robotics. In particular, he is interested in developing algorithms and tools that enable building resilient systems deployed in dynamic environments that can automatically handle goal tradeoffs, incorporate user preferences and constraints, identify causes of failures, and self-adapt to be able to operate in dynamic environments. Pooyan integrates distributed systems, control theory, statistical learning and optimization, causal inference, representation learning, and transfer learning. In addition to the theoretical contributions, he is excited about several directions in ML for Systems and Systems for ML--autonomous systems, robot learning, AI accelerators, system performance optimization, software/hardware co-design, and model robustness.