Podcast:Semi Doped Published On: Tue May 12 2026 Description: Gimlet Labs runs an inference cloud built on heterogeneous silicon. Their software traces a PyTorch workload, segments it into its component parts, and schedules each piece onto the best-suited hardware — connecting chips from different vendors on a single high-speed fabric.In this interview, Gimlet co-founder Natalie Serrino and former Intel executive Beltir walk through the architecture (graph trace, optimal split points, lowering each segment to TensorRT on NVIDIA and equivalents elsewhere), the three customer segments they sell into (frontier labs, sovereign clouds, AI natives), and a concrete demo: on GPT-OSS 120B at 8K input / 1K output, running the speculative decoder on a d-Matrix Corsair card while NVIDIA B200s handle the verifier shifts the throughput-vs-interactivity Pareto frontier roughly 4× over GPU-only speculative decode.The most surprising takeaway: most Neoclouds gave significant equity to a single silicon vendor in exchange for capacity. Hardware amortization is around 70% of their annual costs, and the equity terms prevent them from diversifying their silicon. So the only software innovation they can ship is disaggregation on top of one vendor's stack — never across vendors. Gimlet's two-track model (deploying orchestration software inside customer data centers, plus running their own Neocloud built on mixed silicon) is the answer to that constraint.Read the full transcript on Chipstrat.Chapters:0:00 Intro and the chips no one's connected before0:33 Inference cloud for agents1:02 From Intel to Gimlet2:14 The case for heterogeneous inference4:03 Disaggregating inference by resource profile6:24 Tracing PyTorch into a schedulable graph8:08 Connecting chips never connected before10:52 CPUs as the agentic workhorse12:01 Tool calls in the same data center as the LLM13:21 Latency vs throughput on a shared fabric14:57 Three customer buckets15:54 Sovereigns: make an API call, not a porting project19:37 "Cracked software is the platform"22:24 Why merchant silicon vendors need partners25:18 Hyperscalers outsourcing CapEx, not just kernels28:49 AI natives: latency budgets, not just price32:06 The d-Matrix partnership33:31 The Pareto frontier chart35:56 Speculative decode on Corsair: 4× shift37:27 4× faster, or 3× more customers?41:22 Why most Neoclouds can't follow this model42:34 Gimlet's two-track business model44:30 CoreWeave vs Together vs Gimlet45:15 Series A and hiringRelevant reading:The Information on Gimlet helping OpenAI optimize for Cerebras: https://www.theinformation.com/newsletters/ai-agenda/startup-helping-openai-optimize-ai-cerebras-chipsSachin Katti and Zain Asgar coauthored research at Stanford: https://arxiv.org/abs/2507.19635Follow Chipstrat:Newsletter: https://www.chipstrat.comX: https://x.com/chipstrat