TensorFlow to TF Lite

eIQ AI Toolkit does not support conversion from TensorFlow to quantized TF Lite, as this process is typically handled directly using TensorFlow’s built-in conversion utilities. This guide will demonstrate how to perform this conversion.

TensorFlow uses multiple model representations. We will be focusing on the following representations:

First we need to install required Python packages.

[ ]:
!pip install tensorflow==2.18.1
!pip install numpy
!pip install pillow
!pip install kagglehub
[ ]:
import zipfile
import tensorflow as tf
import numpy as np
import kagglehub
import os

from PIL import Image

Load a SavedModel model

The SavedModel format is a directory that contains a protobuf binary along with a TensorFlow checkpoint.

[ ]:
my_model = tf.keras.models.load_model("saved_model/my_model")
converter = tf.lite.TFLiteConverter.from_saved_model(my_model)

Load a Keras model

A Keras model can be saved using either the HDF5 standard or the new Keras v3 saving format. Typically, it is stored as a single file with one of the following extensions: .h5, .hdf5, or .keras.

[ ]:
filename = "model.h5"
model = tf.keras.models.load_model(filename)
converter = tf.lite.TFLiteConverter.from_keras_model(model)

Load a model from Keras Applications

If you do not have a pretrained TensorFlow model, you can use Keras Applications to select from a collection of prebuilt ones.

[ ]:
model = tf.keras.applications.MobileNet(
    input_shape=None,
    alpha=1.0,
    depth_multiplier=1,
    dropout=0.001,
    include_top=True,
    weights='imagenet',
    input_tensor=None,
    pooling=None,
    classes=1000,
    classifier_activation='softmax'
)
converter = tf.lite.TFLiteConverter.from_keras_model(model)

Quantization in TensorFlow

Quantization is a technique that converts numbers into a smaller set of discrete values, typically reducing the model’s memory footprint and latency. The default quantization mode in the TensorFlow Lite Converter is Dynamic Range Quantization, which produces int8 weights and fp32 activations.

However, ML accelerators such as those on i.MX8MP, i.MX93, or the eIQ Neutron NPU only support int8/int16 operations. Therefore, this default scheme will not fully utilize the NPU.

It is recommended to either:

  • Quantize the entire model to int8, or

  • Use a hybrid scheme with int8 weights and int16 activations.

Calibration Dataset

Quantization requires a calibration dataset to determine the range of values, ideally using data that closely represents what will be used in production. More details can be found in the TensorFlow documentation.

In the example below, we will create a calibration dataset representing ImageNet using its subset tiny-imagenet-200, available at: https://www.kaggle.com/datasets/mariamalkuwaiti/tiny-imagenet-200-zip.

[ ]:
# If you are unable to download the dataset through the Python API, you may directly download it from the link above
filepath = kagglehub.dataset_download("mariamalkuwaiti/tiny-imagenet-200-zip")
with zipfile.ZipFile(filepath, 'r') as zip_ref:
    zip_ref.extractall(os.path.abspath(""))
[ ]:
def representative_data_gen(data_dir=None, num_samples=100):
    image_paths = []
    data_dir = os.path.join(os.path.abspath(""), "tiny-imagenet-200")
    for class_dir in os.listdir(data_dir):
        class_path = os.path.join(data_dir, class_dir, 'images')
        if os.path.isdir(class_path):
            for img_file in os.listdir(class_path):
                if img_file.endswith('.JPEG'):
                    image_paths.append(os.path.join(class_path, img_file))
                    if len(image_paths) >= num_samples:
                        break

    # Preprocess images for the MobileNet application
    for img_path in image_paths:
        img = Image.open(img_path).convert('RGB')
        img = img.resize((224, 224))
        img_array = np.array(img, dtype=np.float32)
        img_array = tf.keras.applications.mobilenet.preprocess_input(img_array)
        img_array = np.expand_dims(img_array, axis=0)
        yield [img_array]

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = lambda: representative_data_gen()

INT8 Quantization

Converts both weights and activations to int8. This is the recommended approach for achieving the lowest latency and the best operator support.

[ ]:
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

INT8W/INT16A Quantization

Converts weights to int8 and activations to int16. This mode offers slightly better accuracy compared to full int8 quantization; however, it results in higher latency, a larger memory footprint, and reduced operator support for int16 kernels.

[ ]:
converter.target_spec.supported_ops = [tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8]
converter.inference_input_type = tf.float32
converter.inference_output_type = tf.int16

The TFLiteConverter setup is now complete. You can proceed to convert the model and write it to a file:

[ ]:
filename = "output_model.tflite"
tflite_model = converter.convert()
with open(filename, "wb") as f:
    f.write(tflite_model)

Depending on your target hardware platform, follow these guidelines to enable inference acceleration using an NPU:

  • For MCX-N, i.MX RT700, i.MX95, i.MX943, and similar devices, refer to the guide: Deploying a TF Lite Model to eIQ Neutron NPU.

  • For i.MX 93, refer to the guide: Deploying a TF Lite Model to i.MX 93.

On other platforms, you can run the model directly, and the NPU will be utilized when available.