Hidden Rate Limits: How Providers Throttle LLM Throughput During Peak Demand

Mar 21, 2024

Background

Have you ever noticed your LLM provider being slower during the day and faster at night? Every API that is resourced constrained needs rate limits. Scaling LLM APIs in particular is very challenging because the resources are constrained on GPU throughput in a world where GPUs are expensive and hard to come by. This means that most providers have a "cap" on their overall throughput (ie the number of tokens / second their fleet of GPUs can produce).

The problem is that demand for LLM applications is highly variable. Most providers have transparent rate limits for the number of tokens / second and number of requests you can use with their APIs, but at peak demand they also include hidden rate limits like throttling speed to expand capacity. Over the past few days we did an investigation of the main LLM providers, and have observed up to a 40% difference in average speed (tokens / second) from the leading LLM providers like GPT4.

Investigation

Yesterday we deployed a script that measures the tokens / second from Claude3, GPT4/3.5, and flyflow every minute and let it run for 24 hours to get a feel for the throughput changes that happen throughout the day. The results show that there's a huge variance in the latency you get from different providers based on the time of day that you're using the API. This can drastically decrease the quality of many LLM applications when users are using them at peak hours.

As displayed in the chart above, when we measure the tokens / second of GPT4 over time, it's extremely variable the performance you get. At the peak, GPT4 averages ~35 tokens / second, but at the trough (roughly 9-10am eastern), GPT4 only averages ~22 tokens / second a roughly 35% decrease. This happens as their API adjusts for increased demand on a limited number of GPUs. For latency-senstive applications, this is completely unacceptable.

Comparing Providers

When building LLM applications, it's key to make tradeoffs between speed / cost and quality of the model. Below we chart the latency of the most popular models out there today.

Here, the highest quality model (claude-3-opus) is also the slowest. For their intelligence level, GPT3.5 and Haiku provide considerably faster speeds. Then we can see the flyflow fine tuned model outperforms the rest by a wide margin. The latency of all of the models changes over time based on usage.

Flyflow

Flyflow uses fine-tuning to optimize for speed and cost while maintaining quality. With a single line code change (change the base_url of your openai provider), we give you access to over 15 open source and closed source models. We then collect the requests / responses from that traffic and use it to fine-tune a custom model that matches the base foundation model's quality while drastically increasing speed and decreasing cost. If this is something your team is interested in, feel free to book a demo and we can get you set up with an api key.