- Introduces a low-rank-based strategy to KV cache compression, one of many key bottlenecks in long-context AI
- Hurries up consideration computation by as much as 6.9x and total era throughput by as much as 3.1x, shifting past reminiscence financial savings to sooner inference
- Chosen as a Highlight paper at ICML 2026, representing about 2.2% of reviewed submissions and about 8.4% of accepted papers
- Following the eye round Google’s TurboQuant at ICLR 2026, STAR-KV presents one other strategy to advancing KV cache compression
- Paper out there on arXiv; supply code launched on GitHub
SEOUL, South Korea, July 2, 2026 /PRNewswire/ — Dnotitia Inc. (Dnotitia), an organization specializing in long-term reminiscence AI and semiconductor-based AI infrastructure applied sciences, has launched the paper and supply code for “STAR-KV: Low-Rank KV Cache Compression through Smooth Thresholding for Adaptive Rank Management.” The know-how was developed by means of a joint analysis effort involving UC San Diego’s VVIP Lab and Dnotitia researchers, and the paper was chosen as a Highlight paper at ICML 2026 (Worldwide Convention on Machine Studying 2026), one of many world’s main conferences in machine studying.

Dnotitia contributed STAR-KV, chosen as an ICML 2026 Highlight Paper, attaining as much as 20x KV cache compression and sooner inference by means of low-rank compression and GPU optimization
Within the experiments reported within the paper, low-rank compression alone diminished the KV cache by as much as 75%. Mixed with the mixed-precision quantization methodology proposed within the paper, STAR-KV compressed the total KV cache by as much as 20x. The know-how additionally improves computation pace by means of customized GPU kernels, growing consideration computation pace by as much as 6.9x and total era throughput by as much as 3.1x. STAR-KV additionally confirmed larger accuracy than main current KV cache compression strategies.
KV cache compression has turn out to be a key technical problem in AI infrastructure. As analysis into lowering the reminiscence bottleneck of long-context AI positive aspects momentum, together with the eye round Google’s TurboQuant at ICLR 2026, STAR-KV presents a brand new strategy that mixes low-rank compression with quantization and GPU execution optimization.
The KV cache is short-term reminiscence saved on the GPU in order that a big language mannequin (LLM) doesn’t need to recompute context it has already processed. As AI evolves into agentic programs that use a number of paperwork, dialog historical past, code, search outcomes, and outputs from exterior instruments, the quantity of context a mannequin should course of is rising quickly. On this surroundings, the KV cache has emerged as a key bottleneck affecting each GPU reminiscence utilization and inference price.
In line with the STAR-KV paper, when a LLaMA-3.1-8B mannequin processes a 128K-token context at a batch dimension of 4, the KV cache accounts for about 81% of complete GPU reminiscence. As long-context AI turns into extra extensively used, KV cache compression is more and more seen as a core AI infrastructure know-how for processing lengthy context at decrease price.
ICML, the place the STAR-KV paper was accepted, is extensively thought to be one of many prime worldwide conferences in AI and machine studying, alongside NeurIPS and ICLR. ICML 2026 will likely be held from July 6 to 11 at COEX in Seoul. This yr, 23,918 papers entered evaluate, 6,352 had been accepted, and 536 had been chosen as Highlight papers. Highlight papers account for about 2.2% of all reviewed submissions and about 8.4% of accepted papers.
Going ahead, Dnotitia plans to additional advance STAR-KV to be used in real-world AI service environments and discover its software to open-source LLM inference frameworks akin to vLLM.
“Applied sciences that assist AI course of longer context sooner and at decrease price are advancing quickly” mentioned MK Chung, CEO of Dnotitia. “STAR-KV addresses the core bottlenecks in KV cache capability and a spotlight processing pace, and Dnotitia goals to contribute to the AI inference ecosystem by means of open sourcing.”

















