Model Deployment Flow

Prerequisites

  • Prior to running Neutron Converter, the model needs to go through quantization step.

    • Generic conversion between dialects and quantization steps illustrated here: * *Conversion & Quantization**.

    • If a TFLite FP32 model is obtained, it can be consumed by * tflite-profiler* and * tflite-quantizer* in order to obtain a TFLite quantized model:

      • A profile file must be generated, using profiling tool * *tflite-profiler**, based on the model and representative dataset.

      • Quantization of the model using profiled guided quantization using * tflite-quantizer* tool.

  • Running neutron-converter requires the following minimum set of parameters:

    • Target SoC

    • Input model (resulted after quantized steps)

 

Tools description

tflite-profiler

This tool profiles a FLOAT (FP16/FP32) model by computing the dynamic range [min, max] for each FLOAT tensor and writing the information to a file called profile with the following format:

<tensor_index_1>,<min_value_1>,<max_value_1>,<freq_tensor_1_bin_1>,<freq_tensor_1_bin_2>,..,<freq_tensor_1_bin_n>,
<tensor_index_2>,<min_value_2>,<max_value_2>,<freq_tensor_2_bin_1>,<freq_tensor_2_bin_2>,..,<freq_tensor_2_bin_n>,
...

tflite-quantizer

This tool quantizes a FLOAT (FP16/FP32) TensorFlow Lite model using profiling guided quantization.

The quantizer uses dynamic range information from a profile file (generated by tflite-profiler) to determine optimal quantization parameters for each tensor.


neutron-converter

The neutron-converter tool is a CLI (Command Line Interface) tool used to convert models in TFLite format for execution on the Neutron NPU.

This tool has the following traits:

  • Consumes a standard TFLite model containing standard TFLite operators.

  • Produces a custom TFLite model containing both standard and custom TFLite operators:

    • Standard operators that are supported by NPU are extracted together and mapped to one or multiple NeutronGraph custom operators in the converted model to be executed by the NPU.

    • Standard operators that are NOT supported by NPU are left unmodified in the converted model to be executed by the CPU.

  • Maps mathematical primitives from the TFLite graph (operators) to execution primitives from the Neutron Library ( kernels).

    • To be noted that this mapping is NOT necessarily 1:1, it can be N:1 and in some special cases also N:M.

  • The converted model is then consumed by the Neutron Runtime consisting of 3 components:

    • TFLite Runtime - Runs on CPU with a registered mechanism to dispatch all the NeutronGraph custom operators to the NeutronDriver component.

    • NeutronDriver - Acts as an interface between CPU and NPU and communicates directly with the **NeutronFirmware **.

    • NeutronFirmware - Drives directly the execution of the NPU hardware.

More information can be found in Neutron NPU Software Tools.