ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources, and failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In this talk, I will present our recent solutions to the above-mentioned challenges: InfAdapter and IPA. InfAdapter combines model-switching and auto-scaling to trade accuracy, cost, and latency in inference serving systems, while IPA dynamically re-configures ML inference pipelines to achieve the tradeoff.
InfAdapter: InfAdapter proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function of accuracy and cost. Model variants are different versions of pre-trained models for the same ML task with variations in resource requirements, latency, and accuracy. We show that serving multiple model variants instead of a single variant can enable more granular adaptation space, achieving better tradeoffs among the conflicting objectives.
IPA: IPA extends InfAdapter to enable runtime adaptation at the pipeline level that composes multiple ML services within a DAG structure. IPA dynamically configures batch size, resources, and model variants to optimize accuracy, minimize costs, and meet user-defined latency SLAs using Integer Programming. It supports multi-objective settings for achieving different tradeoffs between accuracy and cost objectives while remaining responsive to varying workloads and dynamic traffic patterns.
Assistant Professor @ University of South Carolina, Google