PostHole
Compose Login
You are browsing us.zone2 in read-only mode. Log in to participate.
rss-bridge 2026-03-01T04:04:38.788892351+00:00

LiteRT: The Universal Framework for On-Device AI

LiteRT, the evolution of TFLite, is now the universal framework for on-device AI. It delivers up to 1.4x faster GPU, new NPU support, and streamlined GenAI deployment for models like Gemma.


[Google for Developers]

LiteRT: The Universal Framework for On-Device AI

JAN. 28, 2026

Lu Wang

Software Engineer

Chintan Parikh

Product Manager

Jingjiang Li

Software Engineer

Terry Heo

Software Engineer

Facebook

Twitter

LinkedIn

Mail

Since we first introduced LiteRT in 2024, we have focused on evolving our ML tech stack from its TensorFlow Lite (TFLite) foundation into a modern on-device AI framework. While TFLite set the standard for classical ML, our mission is to empower developers to deploy today’s cutting-edge AI on-device just as seamlessly as they integrated classical ML in the past.

At Google I/O ‘25, we shared a preview of this evolution: a high-performance runtime designed specifically for advanced hardware acceleration. Today, we are excited to announce that these advanced acceleration capabilities have fully graduated into the LiteRT production stack, available now for all developers.

This milestone solidifies LiteRT as the universal on-device inference framework for the AI era, representing a significant leap over TFLite for being:

  • Faster: delivers 1.4x faster GPU performance than TFLite, and introduces new, state-of-the-art NPU acceleration.
  • Simpler: provides a unified, streamlined workflow for GPU and NPU acceleration across edge platforms.
  • Powerful: supports superior cross-platform GenAI deployment for popular open models like Gemma.
  • Flexible: offers first-class PyTorch/JAX support via seamless model conversion.

All of this is delivered while maintaining the same reliable, cross-platform deployment you trust since TFLite.

Here is how LiteRT empowers you in building the next-generation of on-device AI.

High-performance cross-platform GPU acceleration

Moving beyond the initial GPU acceleration on Android announced at I/O ‘25, we are excited to introduce the full, comprehensive GPU support across Android, iOS, macOS, Windows, Linux, and Web. This expansion provides developers with a reliable, high-performance acceleration option that scales significantly beyond classical CPU inference.

LiteRT maximizes the reach by introducing robust support for OpenCL, OpenGL, Metal, and WebGPU, via ML Drift, our next-generation GPU engine, allowing you to deploy models efficiently across mobile, desktop, and web. On Android, LiteRT optimizes this further by automatically prioritizing OpenCL when available for peak performance, while falling back to OpenGL for broader device coverage.

Empowered by ML Drift, LiteRT GPU has achieved a significant leap in efficiency, delivering substantial performance gains that average 1.4x faster over the legacy TFLite GPU delegate, significantly reducing latency across a broad range of models. See more benchmark results in our previous announcement.

To enable high-performance AI applications, we have also introduced key technical advancements to optimize end-to-end latency, specifically asynchronous execution and zero-copy buffer interoperability. These features significantly reduce unnecessary CPU overhead and boost overall performance, fulfilling the stringent requirements for real-time use cases like background segmentation and speech recognition (ASR). In practice, these optimizations can result in up to 2x faster performance, as demonstrated in our Segmentation sample app. For a closer look at the improvements, see our technical deep dive.

The following examples demonstrate how easily you can leverage GPU acceleration with the new CompiledModel API in C++:

// 1. Create a compiled model targeting GPU in C++.
auto compiled_model = CompiledModel::Create(env, "mymodel.tflite",
kLiteRtHwAcceleratorGpu);

// 2. Create an input TensorBuffer that wraps the OpenGL buffer (i.e. from
image pre-processing) with zero-copy.
auto input_buffer = TensorBuffer::CreateFromGlBuffer(env, tensor_type,
opengl_buffer);
std::vector<TensorBuffer> input_buffers{input_buffer};
auto output_buffers = compiled_model.CreateOutputBuffers();

// 3. Execute the model.
compiled_model.Run(inputs, outputs);

// 4. Access model output, i.e. AHardwareBuffer.
auto ahwb = output_buffer[0]->GetAhwb();

C++

See more instructions on LiteRT cross-platform development and GPU acceleration from LiteRT DevSite.

Streamlined NPU integration with peak performance

While CPU and GPU offer broad versatility for AI tasks, the NPU is the key to unlock the smooth, responsive, and high-speed AI experience that modern applications demand. However, fragmentation across hundreds of NPU SoC variants often forces developers to navigate a maze of disparate compilers and runtimes. Furthermore, because traditional ML infrastructure has historically lacked deep integration with specialized NPU SDKs, the result has been complex, ad-hoc deployment workflows that are difficult to manage in production.

LiteRT addresses these challenges by providing a unified, simplified NPU deployment workflow that abstracts away low-level, vendor-specific SDKs and handles fragmentation across numerous SoC variants. We have streamlined this into a simple, three-step process to get your models running with NPU acceleration easily:

  • AOT Compilation for the target SoCs (optional): Use the LiteRT Python library to pre-compile your .tflite model for target SoCs.
  • Deploy with Google Play for On-device AI (PODAI) if on Android: Leverage PODAI to automatically deliver the model and runtime to a compatible device.
  • Inference using LiteRT Runtime: LiteRT handles NPU delegation and provides robust fallback to GPU or CPU if needed.

For a full, detailed guide, including colab and sample apps, visit our LiteRT NPU documentation.

To provide flexible integration options that fit your specific deployment needs, LiteRT offers both ahead-of-time (AOT) and on-device (JIT) compilation. This allows you to choose the best strategy based on your application’s unique requirements:

  • AOT compilation: Optimal for complex models with known target SoCs. It minimizes initialization and memory footprint at launch for an "instant-start" experience.
  • On-device compilation: Best for distributing small models across various platforms. It requires no preparation, though first-run initialization costs are higher.

We are collaborating closely with silicon leaders across the industry to bring high-performance NPU acceleration to developers. Our first production-ready integrations with MediaTek and Qualcomm are available now. Read our technical deep-dives to see how we achieved best-in-class NPU performance, reaching speeds up to 100x faster than CPU and 10x faster than GPU:

[...]


Original source

Reply