Published: November 3, 2025
19
29
470

"Just use OpenAI API" Until you need: - Custom fine-tuned models - <50ms p99 latency - $0.001/1K tokens (not $1.25/1K input) Then you build your own inference platform. Here's how to do that:

Most engineers think "build your own" means: - Rent some GPUs - Load model with vLLM - Wrap it in FastAPI - Ship it The complexity hits you around week 2.

Remember: You're not building a system to serve one model to one user. You're building a system that handles HUNDREDS of concurrent requests, across multiple models, with wildly different latency requirements. That's a fundamentally different problem.

What you actually need: > A request router that understands model capabilities. > A dynamic batcher that groups requests without killing latency. > A KV cache manager that doesn't OOM your GPUs. > A model instance pool that handles traffic spikes. And that's just the core

Your <50ms p99 requirement breaks down as: - Network overhead: 10-15ms (you can't fix this) - Queueing delay: 5-20ms (if you batch wrong, this explodes) - First token latency: 20-40ms (model dependent) - Per-token generation: 10-50ms (grows with context length) You have maybe

btw get this kinda content in your inbox daily - http://fullstackagents.substac... now back to the thread -

Image in tweet by anshuman

The first principle of inference platforms: Continuous batching ≠ Static batching Static batching waits for 8 requests, then processes them together. Continuous batching processes 8 requests and adds request #9 mid-generation. vLLM does this. TensorRT-LLM does this. Your

KV cache memory makes things difficult. Llama 70B at 4K context needs 560GB of KV cache for just 32 concurrent requests. Your H100 has 80GB total. PagedAttention (from vLLM) solved this by treating KV cache like virtual memory. Manual implementation? You'll OOM before you

"We have 20 fine-tuned models for different tasks" Now your platform needs model routing based on user intent. Dynamic loading and unloading so you don't keep 20 models in memory. Shared KV cache across similar base models. LoRA adapter swapping in <100ms. This is where

Use OpenAI API when you're under 100K requests/month, using standard models, can tolerate 500ms+ latency, and cost per request is 10x higher than raw compute. Build your own when you have custom models, doing 500K+ requests/month, need sub-100ms p99, or when cost optimization

Let's do the actual math: OpenAI GPT-5 pricing: $1.25 per 1M input tokens, $10 per 1M output tokens 1M requests × 1K input tokens × 500 output tokens = $1,250 input + $5,000 output = $6,250 Your H100 inference platform at $2/hour: 1M requests at 100 req/sec = 2.8 hours = $5.60

Production inference platforms have four layers: Request handling (load balancer, rate limiter, queue). Orchestration (model router, dynamic batcher, priority scheduler). Inference engine (vLLM/TRT-LLM, KV cache manager, multi-GPU coordinator). Observability (per-component

The mistakes that kill DIY inference platforms: > Ignoring queueing theory. Your GPU isn't the bottleneck - your queue is. Requests pile up faster than you can batch them. > Optimizing throughput over latency. Sure you hit 1000 tokens/sec in aggregate, but user experience is

Here's where it gets interesting: speculative decoding, prefix caching, and continuous batching work AGAINST each other. Speculative decoding wants more compute upfront for faster generation. Prefix caching wants more memory to reuse common contexts. Continuous batching wants

The production checklist for inference platforms: > Use continuous batching (vLLM or TensorRT-LLM, not raw PyTorch). > Implement request prioritization from day one. > Monitor per-component latency, not just end-to-end. > Auto-scale based on queue depth, not CPU. > Track

That's it for today. Building an inference platform is a 6-month engineering project with hidden costs everywhere. But when you hit scale? It pays for itself in weeks. The key is knowing when to build vs when to rent. See ya tomorrow!

@athleticKoder Building an inference platform isn't just about managing models, it's scaling under real-world load. What's your approach to dynamic loading and routing?

@athleticKoder A lot to unpack here, but very good rundown to find real chokepoints in REAL applications. What I've observed a lot is that engineers derive functional requirements from the initial toy problem, not really for the production case. Just look at KV cache needs, THEY ARE MASSIVE

@curlyhacks1 EXACTLY

@athleticKoder unfortunately for most, "just use openai api" is not at the top of the curve

Image in tweet by anshuman

@athleticKoder Most companies give up and use LLM inference providers to save themselves the time and stress it will take them to build and maintain theirs. This was an interesting breakdown on how it's done 👍

Image in tweet by anshuman

@Paulfruitful_ This🤣🤣🤣

@smakosh @llmgateway dude, gateway doesn't solve this problem.

@athleticKoder Have you tried Ray Serve? To what extent does it solve the problems you describe here?

@Frodo_Mercury nope didn't try ray @robertnishihara is best person to answer this imo

@athleticKoder bookmarking this right now to never read again

@athleticKoder Ah yes, because when the OpenAI API gets too slow or pricey, nothing says 'problem solved' like creating your own infrastructure and becoming the next tech billionaire... Me and my delusional brain...

@KandaBhaji_x If your app goes viral on a free plan be ready to sell your kidney to pay openai bills

@athleticKoder If it’s difficult and provide lots of value there must be a company that sells setting this up as a service, right?

@athleticKoder Sounds like you're talking about using something like LitServe + the lightning platform to help you build your own inference platform. https://github.com/Lightning-A...

@athleticKoder Your thread is very popular today! #TopUnroll https://threadreaderapp.com/th... 🙏🏼@hash_catcher for 🥇unroll

Share this thread

Read on Twitter

View original thread

Navigate thread

1/37