In this blog post, we explore how batching can be used to serve queries more efficiently. We first discuss padding removal in modern inference engines, a key technique that enables effective batching. We then present practical strategies for forming batches and selecting an appropriate batch size. Finally, we walk through the implementation details and share the resulting performance improvements: a 50% reduction in GPU inference latency—despite using 3X fewer GPUs.