The Inference Shift
How Cheap Chips Could Put Frontier AI in Everyone’s Hands
The Inference Shift
How Cheap Chips Could Put Frontier AI in Everyone’s Hands
March 2026
Research and fact-checking assistance from Claude (Anthropic). Specific claims have been verified against primary sources where possible.
Dr. Amara Osei runs a veterinary clinic in Tamale, a city in northern Ghana. She treats livestock for farmers who earn less in a year than an American spends on a single AI subscription. When a goat presents with symptoms she hasn’t seen before, she doesn’t have a colleague down the hall to consult. She has a phone with intermittent data. A shelf of textbooks from 2014. Her own training. A frontier AI model could synthesize the veterinary literature, cross-reference symptoms, and suggest a differential diagnosis in seconds. But frontier AI lives in data centers in Virginia and Oregon, accessed through $20-to-$200 monthly subscriptions that assume a Western salary and reliable broadband. For Dr. Osei, it might as well be on the moon.
This paper is about a set of technologies that could put it in her hands for the price of a used phone. And about why the economic consequences of that are much larger than most people realize.
The thesis
When an AI model answers a question, that process is called inference. Training is where a model learns. Inference is where it works. Right now, inference mostly runs on expensive GPU chips in enormous data centers, and you pay for access by the month or by the token. What I’m arguing is that inference is migrating toward cheap, simple hardware, and that the migration will restructure more of the economy than the current conversation about AI acknowledges.
Five techniques, all independently demonstrated, collectively eliminate most of what makes inference expensive.
The loudest two attack the math. Neural networks normally store their knowledge as precise decimal numbers and multiply millions of them together for every response. Ternary-weight training replaces those decimals with just three values: negative one, zero, and positive one. Multiplication becomes addition and subtraction, the kind of arithmetic a chip from 2012 can handle. Matmul-free architectures push further, stripping out the large matrix multiplications that are the entire reason GPUs became the default AI hardware. After those two changes, inference no longer needs the most expensive chips on Earth.
The other three are less dramatic individually, but they stack. They all attack memory, the bottleneck nobody outside chip design talks about. As a model generates a long response, it carries an expanding internal state called the KV cache. Multi-head Latent Attention compresses that state. KV cache quantization shrinks it again. And Mixture-of-Experts means only a fraction of the model fires for any given question, so you never load the whole thing at once. A model that used to demand a rack of servers might fit on a single commodity board.
If these techniques converge at production scale, the implications are concrete. Inference-optimized chips could be manufactured on 28-nanometer fabrication lines, two generations behind the cutting edge and already operated domestically by China. They could use ordinary memory and open-source RISC-V processor designs that carry no US licensing restrictions. A complete inference device at that point might cost no more than a game console.
I want to be precise about what I’m arguing. This is not a prediction that Nvidia collapses or that data centers become obsolete. Training will still require massive centralized infrastructure for the foreseeable future. The argument is narrower but still significant: the economics of AI inference, currently dominated by centralized GPU clouds, face a repricing over the next five to seven years. Efficiency gains are shifting where the money is made, away from trillion-dollar cloud infrastructure and toward distributed commodity hardware.
How large that repricing gets depends on one open question: whether ternary training can scale to frontier-quality models. BitNet b1.58, the leading ternary approach, has been published at up to 8 billion parameters, with quality matching conventional models demonstrated at 3 billion. Frontier models are 25 to 80 times larger than the biggest published ternary model. Nobody has shown results at that scale yet. Maybe it scales cleanly. Maybe it hits a wall at 30 billion parameters — some mathematical property of ternary weights that breaks when the network gets deep enough. I genuinely don’t know, and I haven’t met anyone who does. But the results so far make me more optimistic than not.
What I do know is this: the technical components exist individually. China’s manufacturing base is positioned to exploit them. Current US export controls, designed to restrict access to advanced chip fabrication, are aimed at the wrong target, because this threat runs on older fabrication lines the controls don’t cover. And the institutional world has barely started thinking about what happens if inference leaves the data center.
Why this matters to actual people
The financial repricing matters to investors. But most people aren’t investors, so let’s talk about the human story.
Right now, AI is a service. You rent access by the month or pay per token. This is fine if you’re a well-funded company or a professional in a wealthy country. It is a wall for most of the world. A $20 monthly subscription is a meaningful expense for a teacher in rural Indonesia. A $200 enterprise tier is out of the question for a two-person engineering consultancy in Bolivia. Per-token API pricing punishes exactly the kind of exploratory, iterative use that produces the most value. The current model concentrates AI capability in organizations that already have resources.
The inference shift would convert AI from a service into a product. A one-time purchase, like a calculator or a laptop. You buy the device, you own the capability. No subscription, no internet requirement, no recurring cost, no one monitoring your queries.
Right now a junior engineer at a 12-person firm and a senior researcher at a multinational have meaningfully different access to AI analysis. Statistical modeling, literature synthesis, design optimization: these require either cloud subscriptions or serious hardware. If a low-cost inference appliance runs frontier-class models locally, both have the same tool. The question becomes whether you know what to ask and what to do with the answer, not whether you can afford the subscription. Expertise wins. Budget stops mattering as much.
Dr. Osei in Tamale could run a veterinary diagnostic model on a device that costs less than the antibiotics she prescribes in a month. A machinist in a small Ohio shop could optimize CNC toolpaths with the same AI that a Boeing engineer uses, without the enterprise contract. A graduate student in Nairobi could do computational chemistry on a device she owns outright, working offline when the power grid is unreliable.
The cost trajectory is already visible, though the target keeps moving because the best available model keeps getting bigger. Running 2024’s best open-source model required a $55,000 server. By early 2025, quantized versions of that same generation fit on a $2,000 consumer GPU. In 2026, the best open models are much larger (DeepSeek-V3 at 671 billion parameters), and running them locally takes a $5,600 Mac Studio. The cost of running any fixed level of capability is falling fast. The cost of running the frontier keeps falling too, just less dramatically, because the frontier itself moves. The question isn’t whether costs are falling. The question is where the floor is and how fast we reach it.
Personal devices aren’t the only beneficiary. Autonomous robots currently spend 30 to 50 percent of their cost on compute hardware. At commodity chip prices, that drops to single digits, and the bottleneck shifts to mechanical engineering: motors, actuators, materials. Agricultural robots that could scout fields and identify crop disease become economically viable for mid-size farms, not just industrial operations. That transition depends on robotics challenges beyond inference hardware, but the enabling economics are the same.
I have been unable to locate a major think tank, government body, or university which has published a comprehensive analysis of what happens economically when AI inference leaves the data center. The technical community building local inference tools and the institutional community planning for AI’s societal impacts seem to occupy largely separate worlds.
This gap has a precedent. In the early 1990s, engineers building packet-switching networks understood that distributed computing would reshape everything. Institutions making policy still assumed mainframes were permanent.
The efficiency convergence
Three of these five techniques are already running in production. The other two have working prototypes but unanswered questions at scale. To understand why they matter, we need a brief detour into how AI models actually work under the hood.
What’s already here
Mixture-of-Experts (MoE) is the oldest trick on this list and the one nobody argues about anymore. A traditional AI model uses all of its parameters, the numerical values it learned during training, for every query, regardless of what you ask. Wasteful, when you think about it. An MoE model works more like a hospital with specialists. Ask a question and the model routes it to a small group of “expert” sub-networks, leaving the rest idle. DeepSeek-V3 shows how far this goes: 671 billion total parameters, but only 37 billion activate per query. That’s 5.5%. Compute cost scales with the 37 billion, not the 671 billion. Nearly all leading frontier models now use MoE designs.
Multi-head Latent Attention (MLA) compresses a model’s working memory. As an AI generates text, it keeps a running record of everything it has processed so far, called the key-value cache. Think of it as a scratchpad that grows with every sentence. In long conversations, this scratchpad gets enormous and expensive. MLA compresses it. DeepSeek’s implementation reduces the cache by 93-98% compared to standard designs. At 128,000 tokens of context, roughly a 200-page book, the entire cache fits in about 8.6 GB. Before MLA, the binding constraint on running a model was often conversation length. After MLA, it’s storing the model’s learned weights. The scratchpad is no longer the problem. Other companies, including Moonshot AI and Zhipu AI, have adopted MLA in their own models.
Post-training quantization is the simplest idea of the three: round the numbers. A full-precision model stores each parameter as a 16-bit decimal. Quantization rounds those values to fewer bits, shrinking the model’s memory footprint the way compressing an image makes the file smaller at the cost of some detail. INT4 quantization (4 bits per value) preserves roughly 98-99% of the original model’s quality at a quarter of the memory cost. Over 6,500 models on HuggingFace already use it. Google’s TurboQuant, presented at ICLR 2026, achieves 3-bit compression of the key-value cache with what Google describes as negligible accuracy loss. Unsloth’s 1.58-bit dynamic weight quantization enables running DeepSeek R1, normally 1.4 terabytes at full precision, in roughly 131 GB.
Stack these techniques and the numbers get striking. A Llama 70B model that needs 181 GB at full precision runs in about 21 GB with 1.58-bit weights and a 3-bit cache. That is an 8.6x compression. The model that needed a $55,000 server now fits on a $2,000 consumer GPU. These quantized models exist and run today, right now, with modest but measurable quality degradation.
What’s promising but unproven
The two remaining techniques are more speculative, and the uncertainty matters.
BitNet b1.58 is a Microsoft Research project that trains models from scratch using only three values per parameter: -1, 0, and +1. Where quantization rounds a finished model’s numbers down, BitNet builds a model that never uses large numbers in the first place. In Microsoft’s published benchmarks, a 3-billion-parameter BitNet model matches the quality of a conventional model the same size while using 71% less energy for each token it generates. The arithmetic inside BitNet is almost entirely integer addition. No floating-point multiplication. This matters enormously for hardware design, which we’ll come back to.
But there’s a catch. Microsoft has published results only up to about 8 billion parameters. They have not demonstrated BitNet at the 70-billion or 400-billion parameter scales where today’s most capable models operate. That silence is worth taking seriously. If BitNet worked beautifully at 100 billion parameters, we would expect Microsoft to say so. The scaling properties of ternary-weight models remain an open question. There are theoretical reasons to think they might scale (larger models have more redundancy to exploit) but “might” and “does” are different words.
Matmul-free transformers, from UC Santa Cruz, push even further. They replace the matrix multiplications at the core of transformer models with simpler operations like element-wise addition and ternary accumulation. Early results at small scale show competitive quality with 61% lower energy use during generation. But like BitNet, they have been tested only on small models. The scaling question is the same, and the answer is the same: we don’t know yet.
None of this is proven at frontier scale. But the results so far are strong enough, and the economic incentives large enough, that dismissing the possibility would be a mistake. This is the kind of trajectory that serious planners should account for.
The hardware implications of ternary arithmetic
If ternary inference does scale, the hardware consequences are dramatic. A ternary multiply-accumulate unit needs around 5 transistors per operation. A floating-point unit needs 5,000 to 10,000. That is a thousand-to-one difference in silicon complexity for the core arithmetic. A chip built for ternary inference could pack far more compute onto the same area of silicon, using older and cheaper manufacturing processes.
And the energy comparison is just as lopsided. A single floating-point multiplication at 16-bit precision costs roughly 3.7 picojoules. A ternary integer addition: about 0.03 picojoules. Hundred-to-one per operation.
But here is the admission that complicates the clean narrative: in real AI workloads, the energy cost of moving data between memory and processor dominates the energy cost of the arithmetic itself. A DRAM memory access costs about 640 picojoules. That single memory fetch burns 17,000 times more energy than a ternary addition and 170 times more than a floating-point multiply. The memory wall, the gap between how fast processors compute and how fast memory can feed them, is the actual bottleneck. Ternary arithmetic widens that gap further because the compute becomes so cheap that the processor spends even more of its time waiting for data.
This means ternary inference chips are not magic. They need careful memory system design to realize their theoretical advantage. Model compression helps (smaller models mean fewer memory fetches), and techniques like MLA reduce cache pressure, but the memory wall remains real. Any serious assessment of ternary hardware must account for it.
That said, the combination of ternary arithmetic, aggressive quantization, and MLA-style cache compression attacks the problem from multiple angles simultaneously. Simpler arithmetic. Smaller models. Smaller caches. Less data to move. No single technique solves the memory wall. Together, they erode it.
China and the export control paradox
Since 2022, the United States has imposed increasingly strict export controls on semiconductor technology sold to China. The restrictions target the hardware needed to train large AI models: advanced lithography machines, high-bandwidth memory, and cutting-edge chips from NVIDIA and AMD. The logic is straightforward. Training frontier AI models requires the most advanced hardware on earth. Control the hardware, control the capability.
For training, this logic holds. Building a model like GPT-5 or DeepSeek-V4 from scratch demands thousands of the most advanced GPUs available, and those GPUs require manufacturing processes China cannot yet replicate domestically. Export controls impose real costs and real delays on China’s ability to train frontier models.
For inference, the logic is collapsing.
The domestic supply chain
China has been building semiconductor manufacturing capacity at older process nodes for years, steady enough that it’s easy to overlook. SMIC, China’s largest chipmaker, is expanding its 28-nanometer production lines. By training-chip standards these are not cutting-edge, but 28nm is a mature, well-understood process that China can run without any imported equipment subject to current restrictions. Meanwhile, Chinese memory manufacturers are producing commodity DDR5, and Chinese companies have adopted RISC-V, the open-source processor architecture, for chip designs that owe nothing to American or British intellectual property.
None of this is secret. The question that matters is: what kind of AI hardware can you build with these ingredients?
A rough sketch of an inference appliance
If ternary or low-bit integer inference works at scale, the hardware requirements drop to a level that China’s domestic supply chain can meet entirely. A rough bill of materials — not an engineering specification, just a sketch of the cost structure — looks something like this: a RISC-V chip with a ternary accelerator, fabbed at 28nm, runs $8 to $20. Add 64 GB of commodity LPDDR5 memory at $30 to $60. The rest — board, power supply, enclosure, networking, storage — comes in around $20 to $35. Total bill of materials: somewhere between $60 and $115.
Those numbers are approximate. Real hardware development introduces costs they ignore: firmware, testing, yield losses, software ecosystem. But the order of magnitude is what matters. We are not talking about million-dollar server racks. We are talking about something in the price range of a game console.
The mismatch
Put the two lists side by side — what export controls restrict, and what a ternary inference chip actually needs — and almost nothing overlaps. EUV lithography is restricted; ternary chips use DUV, which is unrestricted and domestically available in China. Compute chips above certain performance thresholds are restricted; integer addition chips fall so far below those thresholds they wouldn’t trigger a second look. High-bandwidth memory is restricted; ternary inference runs on commodity DDR5 that Chinese fabs already produce. Every component in the restricted column has a commodity substitute in the ternary column. Even the processor architecture dodges the problem: RISC-V is open-source, no American or British license required.
The export control regime was designed to restrict a type of computing — dense floating-point matrix multiplication at scale — that ternary inference largely abandons.
The feedback loop
This is where the analysis turns uncomfortable for policymakers, because the dynamic is self-reinforcing:
The US restricts China’s access to advanced training hardware.
Chinese labs, squeezed for resources, optimize aggressively for efficiency.
Those efficiency innovations are published as open-source research and open-weight models.
The innovations reduce the hardware requirements for running AI inference.
The reduced requirements fall below export control thresholds.
China manufactures compliant inference hardware on unrestricted 28 nm production lines.
Go to step 1.
Each turn of this loop makes inference cheaper, less dependent on advanced hardware, and harder to control. The pressure that export controls apply to training is the same pressure that drives the efficiency research that undermines export controls on inference. The policy contains the seed of its own obsolescence.
DeepSeek is already a product of this dynamic. Constrained by chip restrictions, their team produced some of the most important efficiency innovations in the field (MLA, aggressive MoE routing, training optimizations) and published them openly. The restrictions worked exactly as intended on the training side: they forced Chinese labs to do more with less. The unintended consequence is that “more with less” is precisely the formula that makes inference hardware simple enough to escape the controls entirely.
You cannot embargo addition.
That sentence sounds glib, but it captures a real structural problem. Export controls are built around the complexity of the hardware they restrict. Floating-point matrix multiplication at scale requires complex, advanced hardware. Integer addition does not. If the core operation of AI inference shifts from the former to the latter, the controls lose their technical footing. You would need to restrict something so basic, so universal, that the restrictions would sweep in vast categories of ordinary computing equipment. No current policy framework is designed to do that, and designing one would be enormously disruptive to global trade in electronics.
Policy options, honestly assessed
What can policymakers actually do? The options are limited, and none of them are clean.
They could tighten thresholds: lower the performance ceiling that triggers restrictions. This buys time but accelerates the feedback loop: tighter constraints produce more efficiency pressure, which produces more innovations that route around the constraints. The treadmill speeds up.
They could try to restrict model weights instead of hardware. Control the distribution of trained models rather than the chips that run them. This would be a major policy shift and faces brutal enforcement problems. Model weights are files. They can be copied, shared, and distributed over the internet. Once an open-weight model is released, restricting its spread is more like controlling information than controlling physical goods. Some policy thinkers advocate this approach anyway, arguing that imperfect enforcement is better than none.
They could accept that inference will become cheap and widely available, and pour resources into maintaining the training lead, making sure the most capable models are trained in the US and allied countries first. This concedes the inference layer but tries to hold the frontier at the training layer. Probably the most realistic option, but it means accepting a world where American-trained models run on Chinese-built hardware.
Or they could negotiate. Use the current hardware advantage as leverage in broader technology talks before it erodes. The US has more bargaining power today than it will in three years.
None of these options are comfortable. The feedback loop does not have an obvious off switch. But pretending the loop doesn’t exist, continuing to treat export controls as a durable solution for controlling AI inference capability, is the least realistic option of all.
The money
Here is the uncomfortable math behind the AI boom: in 2026, the major technology companies will spend somewhere between $600 billion and $700 billion building AI infrastructure. By the end of 2028, cumulative spending will approach $3 trillion. These are real dollars being converted into physical things: chips, servers, cooling systems, power substations, buildings full of humming metal.
The revenue to justify that spending does not yet exist. OpenAI, the most commercially successful AI company in the world, brought in roughly $13.1 billion in revenue in 2025 against $8 billion in operating costs. That’s not a disaster, but it’s not the kind of return that justifies trillions in industry-wide infrastructure. Across big tech, 94% of operating cash flows are now going to capital expenditures. The companies building AI are, collectively, betting almost everything they earn on the conviction that demand catches up before the bills come due. Whether it does is the question nobody can answer honestly.
The Jevons Paradox in action
In 1865, the economist William Stanley Jevons observed something counterintuitive about coal. As steam engines became more efficient and used less coal per unit of work, total coal consumption didn’t fall. It rose. Dramatically. Cheaper energy per task meant people found more tasks worth doing.
That same pattern is playing out in AI inference right now, and a single data point from January 2025 shows how. When DeepSeek released its R1 model, demonstrating that capable reasoning models could be trained for a fraction of what Western labs were spending, the market panicked. NVIDIA lost $589 billion in market capitalization in a single day. Cheaper AI means less hardware, so the buildout is overkill. That was the logic, and it seemed obvious. Instead, spending accelerated. Markets panicked for a day and then doubled down.
Stanford’s HAI index tells the fuller story. Since November 2022, the cost of running an inference query has fallen roughly 280-fold. Two hundred and eighty times cheaper. If the simple economics of “cheaper means less spending” held, infrastructure investment should have cratered. Instead, it grew enormously, because every time inference got cheaper, people found new things to do with it. Coding assistants. Document analysis. Image generation. Agentic workflows that chain dozens of inference calls together. The denominator shrank, but the numerator exploded.
This is Jevons, exactly. Cheaper AI per query means more queries, not fewer.
But Jevons has limits. Coal consumption did eventually plateau, not because efficiency stopped improving, but because alternatives emerged and saturation set in. The question is whether AI inference demand will keep outrunning efficiency gains for another two years, another five, or another decade. Nobody actually knows. The current trajectory favors continued growth, but trajectories change.
The smartphone parallel
There’s a historical analogy that helps here. When smartphones put a capable computer in every pocket, many people assumed mobile computing would cannibalize cloud computing. Why pay for remote servers when everyone carries a processor?
What actually happened was the opposite. Smartphones created entirely new categories of cloud-dependent applications. Photo backup, streaming video, ride-hailing, real-time navigation: all of these run partly on your phone and partly in a data center. Mobile devices didn’t replace the cloud. They became the cloud’s best customers.
If local AI inference follows the same pattern, and there are good reasons to think it might, then putting AI capability on personal devices won’t eliminate demand for cloud AI. It will create new hybrid workflows where local models handle simple tasks and cloud models handle the hard ones. Your device drafts the email; a data center reviews the contract. Both need infrastructure.
This is the strongest bull case for current spending levels: not that local inference won’t happen, but that it will feed cloud demand rather than replace it.
But the bull case has a gap. Jevons tells you total demand grows. It doesn’t tell you that every provider’s share grows with it. If local devices absorb the simpler half of all inference queries, cloud demand needs to more than double just to keep the data centers full. Total inference could increase tenfold and centralized infrastructure could still be overbuilt, if enough of that growth happens on devices people own.
Training as ballast
Training frontier AI models is expensive and getting more so. GPT-4’s training run cost over $100 million. The next generation of models will likely cost billions. The generation after that, tens of billions. And training is not a one-time expense: companies retrain, fine-tune, and run continuous evaluation pipelines that consume enormous compute.
Even if inference shifts substantially toward local devices, training will remain centralized for the foreseeable future. You cannot train a frontier model on a desktop. The physics of it, the data movement, the parallelism, the sheer scale, requires exactly the kind of dense, high-bandwidth infrastructure being built today. This means a large share of current capital spending is justified by training demand alone, regardless of what happens with inference.
Who’s exposed
Not everyone is equally exposed. Smaller cloud companies that built their entire business around renting GPU time for inference sit closest to the blast radius. If a meaningful share of inference moves to personal devices, their core product shrinks. Microsoft, Google, and Amazon have more diversified bets and can repurpose infrastructure across training, inference, and their own products. Companies supplying training infrastructure (high-end chips, high-bandwidth memory, networking) are probably the safest, since training demand shows no signs of slowing regardless of what happens at the edge.
Got a 401(k)? Index funds? A retirement account tilted toward US tech? Then you already own a piece of this bet, whether you meant to or not. All seven of the largest companies in the S&P 500 are pouring money into AI infrastructure. If inference economics get repriced, that doesn’t just move stock tickers. It ripples into the retirement savings of tens of millions of people who never made a conscious decision to bet on data centers.
On the employment side, there’s a pattern here worth worrying about. Right now the buildout has created a surge of jobs: data center construction, operations, the energy infrastructure feeding all of it. If centralized inference demand grows more slowly than planned, some of those jobs prove temporary. And for anyone whose livelihood depends on selling AI access (consultants reselling cloud API subscriptions, startups built as thin wrappers around hosted models) the shift from service to product changes the ground under their feet. The value moves from access to expertise: not “can you get me AI?” but “can you help me use it well?”
I want to be fair here: the spending is not irrational. It is a bet on a particular future, one where centralized inference demand keeps growing fast enough to fill all these data centers. That future is plausible. It might even be probable. But it is not certain, and $3 trillion is a lot of money to bet on a plausible-but-uncertain outcome.
A rough timeline
What follows is my best guess at timing, and I want to be upfront about how often technology timelines age badly. Anyone who predicted self-driving cars by 2020 should be humble right now. With that said:
2026-2028: The hybrid era
This phase is already underway. Quantized versions of large language models, compressed to run in less memory at some cost to precision, are running on high-end consumer hardware today. A machine with a good GPU and 32 to 64 gigabytes of memory can run models that would have required a data center two years ago.
The practical experience during this phase: a technically inclined person with $5,000 to $10,000 to spend on hardware can run AI models locally that perform at maybe 80 to 95 percent of the best cloud-hosted models. The gap is real but shrinking. For writing assistance, code completion, document summarization, and image generation, local models are already good enough for most tasks. For complex reasoning, long-context analysis, or cutting-edge multimodal work, cloud models still win clearly.
A technology to watch is LPDDR6, the next generation of low-power system memory, expected to reach consumer devices in this window. It could enable desktop machines with 256 to 512 gigabytes of unified memory at reasonable prices, which would dramatically expand what local models can do. Memory, not processing power, is the primary bottleneck for running large models locally.
For most people during this phase, cloud AI remains the default. Local inference is a hobbyist pursuit and a tool for developers. But the seeds of the shift are planted.
2028-2030: The fork
This is where the timeline genuinely branches, and honesty requires presenting both paths.
Path A: Ternary computing works at scale. If the research into ternary neural networks delivers on its promise, and the early results give real reason to think it might, the implications are dramatic. A model running in ternary uses a fraction of the memory and energy of the same model in conventional formats. Combined with continued hardware improvements, this could put genuinely powerful AI into a low-cost desktop device. Not a hobbyist rig. A consumer appliance, sold at Best Buy, that runs local AI well enough for everyday use. In this future, the shift from cloud to local inference happens fast, within a product cycle or two.
Path B: Ternary doesn’t scale. Most promising research directions hit walls before reaching commercial viability, and this one might too. In this case, the shift still happens, just slower and driven by conventional improvements in chip efficiency and quantization techniques like INT4 and INT8. Local inference hardware is more expensive and handles most daily AI tasks, but with more noticeable trade-offs in quality. Cloud remains dominant for anything demanding. The infrastructure buildout has more time to generate returns before demand shifts.
What does this feel like for a regular person? In Path A, by 2029 you might buy a device that sits on your desk and handles almost everything you currently use ChatGPT for, without an internet connection and without a subscription fee. In Path B, you get partway there: local AI handles your simpler requests, but you still reach for cloud services a few times a week for the harder stuff.
2030-2032: The new normal
By the early 2030s, under either path, routine AI inference has largely moved to local devices. A low-cost device, whether it’s a dedicated appliance, a feature of your next computer, or built into your phone, runs models equivalent to what the best cloud services offered in 2028 or 2029. For writing, coding, image work, conversation, and analysis of personal documents, there’s no reason to send your data to someone else’s server.
But a permanent two-tier structure emerges, and this is the most important prediction in this section. Local devices handle everyday AI tasks. Cloud infrastructure handles the hardest problems: training new frontier models, running the largest and most capable systems, processing workloads that require more memory or compute than any personal device can offer. This mirrors what happened with computing generally. You do most things on your laptop. Some things still need a server farm.
The fiber optic crash of 2000-2002 offers a useful parallel for what this transition might look like economically. In the late 1990s, telecom companies laid enormous amounts of fiber optic cable based on projections of internet traffic growth. The projections were actually right: internet traffic did grow as fast as predicted and then some. But the companies that built the infrastructure mostly went bankrupt anyway, because they’d borrowed too much money and the technology improved faster than expected, meaning less fiber was needed per unit of traffic than they’d planned for. The infrastructure eventually got used. The investors who paid for it mostly lost their money.
Something similar could happen with AI data centers. The demand for AI will probably keep growing. But if local inference absorbs a meaningful share of that demand, some fraction of the centralized infrastructure being built today won’t be needed at the scale its builders are planning for. Not all of it. Not even most of it, probably. But enough to matter financially.
This is the central tension of the next five years: AI demand is real and growing, the technology to serve that demand locally is also real and growing, and trillions of dollars are being wagered on the assumption that centralized infrastructure will remain the primary way people access AI. That assumption holds today. It probably holds tomorrow. But by 2030, it will be at least partially wrong, and the financial consequences of “partially wrong” at this scale are serious.
The Western response
A natural objection: if China ships a 28 nm ternary inference chip, Western chipmakers will respond with something better. Almost certainly. And that response accelerates the disruption rather than preventing it.
A Western ternary chip fabricated at TSMC’s 3 nm node would be dramatically more capable than a Chinese 28 nm equivalent. The physics are straightforward. Smaller transistors mean more of them per square millimeter, which means more parallel processing elements on the same piece of silicon.
Processing elements per chip: 28 nm (Chinese fab) ~8,750 vs. 3 nm (TSMC) ~100,000+
Ternary ops/sec: 28 nm 80-140T vs. 3 nm 1,000-4,000T
Power draw: 28 nm 5-15W vs. 3 nm 10-20W
Chip manufacturing cost: 28 nm ~$2.80 vs. 3 nm $30-50
Retail price at volume: 28 nm game-console range vs. 3 nm below game-console range
The 3 nm version delivers 10-30x more compute at similar power. You could shrink the chip to 10 mm² and still match the 28 nm version’s performance. That’s small enough to fit in a pair of glasses or a wristwatch. A locally-run language model on your face, no cloud connection required.
So picture the competitive sequence. China ships a 28 nm ternary chip at game-console pricing. It’s slow by frontier standards but good enough for 7-14B parameter models at conversational speed. Western companies respond within 12-18 months with a 3 nm version: cheaper, 10-30x the performance, runs 200B+ dense models or 600B+ MoE models locally with the compression techniques described earlier. Consumers get a better product at a lower price. The market is validated. Inference hardware becomes a commodity. Margins collapse to consumer electronics levels, somewhere around 15-20%.
Several Western companies are positioned to make this move. Tenstorrent, Jim Keller’s RISC-V inference startup, closed a $693M Series D. Apple already has unified memory architecture and a mature ML framework. Qualcomm ships neural processing units (NPUs) in over 80% of mobile devices sold worldwide. And NVIDIA itself spent roughly $20B to acquire Groq, a company built entirely around inference-specialized chips.
You don’t spend $20B on an inference-specialized company if you believe GPU-based inference is safe. The Groq deal is NVIDIA hedging against its own product line.
And here’s why none of this saves the GPU pricing model. Whether the winning chip is made in Shenzhen or Hsinchu, the economic result is the same: inference migrates from $25,000 GPUs drawing 700W to consumer-priced chips drawing 5-20W. NVIDIA’s training revenue probably survives. Training workloads genuinely need the massive parallelism and memory bandwidth that H100s and B200s provide. But training is the smaller market. Inference is where the volume lives, and inference margins are the ones about to compress.
Which means the Western response doesn’t slow anything down. It finishes the job.
What happens next
Two variables control the pace of everything above. The first is whether ternary-native training scales to 70B+ parameter models. Current evidence says yes for small models. Whether the approach holds at scale is an open experimental question, not a theoretical one. Someone has to build the thing and find out. The second variable is whether anyone actually commits to fabricating the chip. A design on paper is worth nothing. Tape-out, the step where a chip design gets sent to a foundry for manufacturing, is expensive, and the first mover takes real financial risk on an unproven architecture.
A third variable is more predictable: DRAM supply normalization. Memory prices follow well-documented commodity cycles. The current AI-driven shortage will ease as new fab capacity comes online, and when it does, the cost of the 64-128 GB memory configurations that local inference requires will drop by half or more. This isn’t a question of if. It’s a question of which quarter.
The financial stakes are large enough to be destabilizing. If inference becomes a commodity hardware problem, hundreds of billions of dollars in current market valuations are wrong. Not slightly wrong. Wrong by multiples. Companies priced on the assumption that inference requires expensive accelerators are priced on an assumption with a short shelf life.
The geopolitical consequences might matter more. Export controls on advanced chips are the primary lever the United States uses to slow Chinese AI development. If China routes around those controls by building inference hardware on nodes it already has, that lever stops working. Not because the controls are lifted, but because the thing they restrict is no longer the thing that matters. The policy designed to maintain advantage may be the policy that eliminates it, by forcing the development of an architecture that makes the restricted technology less relevant.
I don’t know if the first ternary chip ships in 2027 or 2030 or never. The technical risks are real, and I’ve laid them out as clearly as I can. But I keep coming back to the economics: the incentives are enormous, the physics are permissive, and multiple independent actors are converging on the same conclusion from different directions. That pattern usually means something.
Think about Dr. Osei in Tamale. What changes for her isn’t abstract. It’s a box on her desk that costs less than her monthly supply budget. It runs a diagnostic model trained on every veterinary textbook published in the last twenty years. It works when the internet doesn’t. Nobody charges her per question. The technology to build it exists in pieces today. The economics say someone will assemble those pieces. The only real questions are when, and whether the institutions that could prepare for this shift will notice in time.
The metric to watch: the cheapest complete system that can run the current best open-source language model at interactive speed. In 2024, that cost was roughly $55,000. By early 2026, it had fallen to about $5,600 (for a much larger model). Track this number. When it reaches consumer electronics pricing, the shift is no longer speculative. It’s underway. When it reaches impulse-purchase territory, it’s over.
We may look back on the export control era the way we now look at 1970s oil embargoes: a period when restricting a critical resource didn’t weaken the target so much as it forced them to find something better. That is not a comfortable conclusion. But it is where the evidence points.
Further reading
The claims in this essay draw on primary sources. For readers who want to go deeper, these are good starting points:
Ternary training: Ma et al., “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits” (Microsoft Research, Feb 2024). The foundational BitNet b1.58 paper. Also: “BitNet b1.58 2B4T Technical Report” (arXiv:2504.12285) for the open-source 2B release, and “Bitnet.cpp: Efficient Edge Inference for Ternary LLMs” (ACL 2025) for CPU inference benchmarks.
Matmul-free models: Zhu et al., “Scalable MatMul-free Language Modeling” (UC Santa Cruz, NeurIPS 2024; arXiv:2406.02528). Demonstrates competitive LLM performance without matrix multiplication at up to 2.7B parameters.
DeepSeek architecture: “DeepSeek-V3 Technical Report” (arXiv:2412.19437). Covers Multi-head Latent Attention, MoE routing, and the training efficiency innovations. Also: Epoch AI, “How Has DeepSeek Improved the Transformer Architecture?” for an accessible walkthrough.
KV cache compression: Google Research, “TurboQuant: Redefining AI Efficiency with Extreme Compression” (ICLR 2026). Compresses key-value caches to 3 bits with no retraining and no measurable accuracy loss. Tom’s Hardware and VentureBeat both have accessible writeups.
Inference cost trends: Stanford HAI, “AI Index Report 2025.” Documents the 280x decline in inference costs since November 2022 and hardware cost trends.
China semiconductor expansion: Rhodium Group, “Thin Ice: US Pathways to Regulating China-Sourced Legacy Chips” (2025). Also: DIGITIMES, “China’s 28nm Foundry Capacity to Hit 31% by 2027” (May 2025) for SMIC and HLMC expansion specifics.
Export control dynamics: US-China Economic and Security Review Commission, “Two Loops: How China’s Open AI Strategy Reinforces Its Industrial Dominance” (March 2026). The “digital loop / physical loop” framework referenced above. Also: American Affairs, “Innovation Under Pressure: China’s Semiconductor Industry at a Crossroads” (Feb 2026).
RISC-V and Chinese chip design: Reuters, “Alibaba Unveils Next-Gen Chip for Agentic AI” (March 2026). Covers the XuanTie C950, currently the highest-performing RISC-V processor.
Inference hardware competition: SemiAnalysis, “Nvidia: The Inference Kingdom Expands” (GTC 2026 coverage). Covers the $20B Groq acquisition and NVIDIA’s inference strategy. Also: Tenstorrent’s Series D announcement (Dec 2024) for the RISC-V inference startup landscape.
Photonic interconnects: IEEE Spectrum, “Lightmatter’s Optical Interposers Could Start Speeding Up AI in 2025.” Also: Lightmatter’s Open Compute Project reference architecture initiative for co-packaged optics standardization.
The DeepSeek market shock: Yahoo Finance / Reuters, “Nvidia Stock Plummets, Loses Record $589 Billion as DeepSeek Prompts Questions Over AI Spending” (Jan 27, 2025).
Appendix: When the boxes talk to each other
Everything above sticks to hardware that exists or is in active development. This section doesn’t. I’m going to speculate, and I want to be clear about that. The core question is simple: if cheap inference devices arrive, someone will connect more than one of them together. What happens then?
Eight boxes on a desk
You don’t need a rack of hardware to make this interesting. Eight inference devices, at the commodity prices discussed above, would cost somewhere around $1,500 to $2,000 total. That’s less than a decent laptop. They’d draw about 200 watts collectively, less than a gaming PC. And if each device carries 64 to 128 GB of memory, the cluster holds somewhere between half a terabyte and a full terabyte, enough to run eight large specialized models simultaneously.
Why would you want eight models instead of one? Because complex work has parts, and the parts need different skills.
A single AI model, no matter how good, can only do one thing at a time. Ask it to analyze a problem, then draft a solution, then critique that solution, then check the math, and you’re waiting through four sequential passes. But if you have eight devices, you can run a different model on each one: a structural analysis model, a cost estimation model, a materials model, a code generation model, and a critic that reviews the others’ work all operating on different facets of the same problem at the same time. The technical term is “multi-agent orchestration.” A coordinator model assigns tasks, collects outputs, and synthesizes a final result. This pattern already exists in cloud-based AI workflows. What changes with cheap hardware is that you can do it locally, privately, and without per-query costs.
At this scale, the connection between devices is a solved problem. These models communicate by passing text to each other: prompts in, responses out. That’s kilobytes, not the gigabytes-per-second that splitting a single model across chips would require. A standard network switch or direct copper links handle it easily. No exotic hardware needed.
What a single person can do with this setup is probably underappreciated.
A lot of knowledge work requires teams not because one person can’t understand the whole problem, but because one person can’t perform all the analysis at the same time. A senior engineer understands stress analysis, cost estimation, and manufacturing constraints, but running all three takes days of sequential work, or hiring junior staff to parallelize it. Eight devices turn that into an overnight job for one person.
Maybe not a perfect job. Maybe the cost model needs a sanity check and two of the FEA runs need different boundary conditions. But the senior engineer spends Tuesday reviewing and correcting instead of spending Monday through Thursday doing. A solo consultant with domain expertise and a small cluster starts to compete with a small firm. A small firm starts to look like a much larger one.
The competitive advantage shifts. The question stops being “how many people can I hire” and starts being “does anyone here actually know what they’re talking about.” An experienced engineer with deep domain knowledge gets more from eight boxes than a ten-person team of junior analysts running standard procedures, because the bottleneck moves from compute and labor to expert judgment. Consulting firms that bill for bodies are the ones exposed. Staffing models built on the assumption that complex analysis requires large teams. When parallel analysis costs a one-time hardware purchase, the value of a warm body running a standard procedure drops with it.
I want to be clear that this isn’t the “robots took our jobs” story. The minimum viable team for many kinds of knowledge work just shrinks, and it shrinks in favor of people who actually know things. The force multiplier goes to the person with domain expertise, not the person with the biggest payroll.
The wall outlet as the ceiling
How far does this scale before you hit a physical constraint? Further than you might expect. A standard 15-amp household circuit in the United States delivers about 1,800 watts. Set aside 200 watts for overhead, a network switch and a cooling fan, and you have 1,600 watts available for inference hardware. At 25 watts per device, that’s 64 devices on a single wall outlet.
Sixty-four devices change the character of what you’re looking at. By the early 2030s, commodity high-bandwidth memory (something comparable to today’s HBM2, which by then would be two or three generations behind the cutting edge) gets cheap enough to include in each device. 128 GB per device, times 64, gives you a raw total of 8.2 terabytes. And the memory wall problem I mentioned earlier? HBM largely solves it. Roughly five times the data throughput of standard DDR5, which means the processor spends less time starving for data and more time actually computing.
Not all 8.2 terabytes are usable for a single task — coordination overhead and working space consume some — but realistically 5.5 to 6.5 terabytes are available for model weights and active processing. I had to sit with that number for a minute. That is a lot of memory for something plugged into a wall outlet.
For reference, DeepSeek-V3 at 671 billion parameters fits in about 131 GB when aggressively quantized. You could hold a dozen large specialized models and a handful of smaller utility models all at once, with room to spare. In terms of throughput, each device can run a 400B+ dense model independently. Sixty-four of them could run 30 to 50 simultaneous large-model instances at conversational speed, or split truly enormous models, well above a trillion parameters, across the cluster for tasks that need a single very large model rather than many smaller ones.
At 64 devices, however, the connection problem gets harder. Eight devices passing text to each other is easy. Sixty-four devices coordinating on complex tasks, or splitting a single model’s computation across many chips, requires much more data to move much faster. Photonic interconnects, using light instead of electrical signals through copper, are the likely solution. Dropping energy per bit by 5 to 10x and cutting latency enough that the user wouldn’t notice the computation was distributed. Companies like Ayar Labs and Lightmatter ship early versions for data centers today. Consumer-priced versions are probably a 2030+ development. They don’t exist yet, and that’s worth stating plainly.
The total hardware cost at this scale is comparable to a single high-end GPU server. The electricity cost is modest: 1,600 watts running continuously comes to about $1,680 per year. Less than $5 a day.
The 1,600W cluster at a glance: 64 devices with commodity high-bandwidth memory, hardware cost comparable to a single GPU server, 8.2 TB raw memory (~5.5-6.5 TB usable), 30-50 simultaneous 400B+ model instances (or fewer instances of trillion-parameter models split across devices), ~$1,680/year electricity. Fits on a single household circuit. Runs 24/7.
What this looks like in practice
Numbers are abstract. Here are two examples of what a 64-device cluster could do, chosen because they represent real workflows that real people struggle with today.
A small engineering firm. Say a ten-person mechanical engineering shop. They need a structural bracket. Happens all the time. Right now their workflow is: one engineer spends two days modeling candidates in CAD, runs FEA overnight on maybe five of them, eyeballs the manufacturing constraints over coffee, gets a cost estimate from the shop floor by Thursday. With a cluster running overnight, one model generates candidate designs. A second runs finite element analysis on each, checking stress concentrations and failure modes. A third evaluates each design against manufacturing constraints: can the shop’s CNC machines actually cut this geometry, or does it require five-axis milling they don’t have? A fourth estimates material and machining cost. By morning the engineers have fifty vetted designs ranked by strength-to-weight ratio and cost. I’m not saying the AI designs are all good. Probably half of them have some issue the models missed. But starting Monday morning with fifty options to sort through instead of five? That changes the week. And none of their proprietary geometry ever left the building, for the electricity cost of a light bulb.
An investigative newsroom. A newspaper runs models that continuously scan public court filings, corporate disclosures, property records, and government contract databases. One model reads and summarizes documents. Another identifies connections between entities (this company’s board member is also a consultant for that government agency). A third cross-checks flagged patterns against the newsroom’s existing reporting. When the system finds something that looks like a story, it writes a two-paragraph brief and sends it to an editor. This kind of persistent, broad monitoring of public records is currently possible only for large news organizations with dedicated data journalism teams. A cluster makes it accessible to a regional newspaper with five reporters.
What this is not
It would be irresponsible to present the clustering scenario without its limitations.
The software is immature. Making multiple AI models collaborate effectively on a single problem is at least as hard as making multiple humans collaborate effectively, and anyone who has managed a team knows how well that usually goes. Current multi-agent orchestration frameworks work for narrow, well-defined tasks. General-purpose orchestration that handles unexpected failures, contradictory outputs, and ambiguous task decomposition is an active research area, not a solved problem.
Not all tasks benefit from parallelism. If you ask a simple question and one model answers it correctly in two seconds, throwing seven more at the problem doesn’t help. The multi-device approach pays off for complex, multi-step work where different subtasks require different capabilities. For quick lookups and casual conversation, a single device is fine.
At the 64-device scale, photonic interconnects that don’t yet exist at consumer prices become a real dependency. Eight devices can get by with standard networking. Sixty-four devices doing tightly coordinated work probably cannot, at least not without noticeable performance compromises.
And the timeline matters. The eight-device scenario could plausibly arrive within a few years of commodity inference hardware shipping. The 64-device scenario with photonic interconnects is a 2031-2033 proposition at the earliest, dependent on multiple technologies maturing simultaneously. Any one of those dependencies could slip.
I think the direction is right even if the timing is uncertain. The physics and economics point toward clustering, the way they pointed toward personal computers in the 1970s and smartphones in the early 2000s. But pointing toward something and arriving there are different, and the gap between them is where predictions go to die.
