Model Deployment Flow¶
Prerequisites¶
Prior to running Neutron Converter, the model needs to go through quantization step.
Generic conversion between dialects and quantization steps illustrated here: * *Conversion & Quantization**.
If a TFLite FP32 model is obtained, it can be consumed by * tflite-profiler* and * tflite-quantizer* in order to obtain a TFLite quantized model:
A profile file must be generated, using profiling tool * *tflite-profiler**, based on the model and representative dataset.
Quantization of the model using profiled guided quantization using * tflite-quantizer* tool.
Running neutron-converter requires the following minimum set of parameters:
Target SoC
Input model (resulted after quantized steps)
Tools description¶
tflite-profiler¶
This tool profiles a FLOAT (FP16/FP32) model by computing the dynamic range [min, max] for each FLOAT tensor and writing the information to a file called profile with the following format:
<tensor_index_1>,<min_value_1>,<max_value_1>,<freq_tensor_1_bin_1>,<freq_tensor_1_bin_2>,..,<freq_tensor_1_bin_n>,
<tensor_index_2>,<min_value_2>,<max_value_2>,<freq_tensor_2_bin_1>,<freq_tensor_2_bin_2>,..,<freq_tensor_2_bin_n>,
...
tflite-quantizer¶
This tool quantizes a FLOAT (FP16/FP32) TensorFlow Lite model using profiling guided quantization.
The quantizer uses dynamic range information from a profile file (generated by tflite-profiler) to determine optimal quantization parameters for each tensor.
neutron-converter¶
The neutron-converter tool is a CLI (Command Line Interface) tool used to convert models in TFLite format for execution on the Neutron NPU.
This tool has the following traits:
Consumes a standard TFLite model containing standard TFLite operators.
Produces a custom TFLite model containing both standard and custom TFLite operators:
Standard operators that are supported by NPU are extracted together and mapped to one or multiple
NeutronGraphcustom operators in the converted model to be executed by the NPU.Standard operators that are NOT supported by NPU are left unmodified in the converted model to be executed by the CPU.
Maps mathematical primitives from the TFLite graph (operators) to execution primitives from the Neutron Library ( kernels).
To be noted that this mapping is NOT necessarily 1:1, it can be N:1 and in some special cases also N:M.
The converted model is then consumed by the Neutron Runtime consisting of 3 components:
TFLite Runtime - Runs on CPU with a registered mechanism to dispatch all the
NeutronGraphcustom operators to the NeutronDriver component.NeutronDriver - Acts as an interface between CPU and NPU and communicates directly with the **NeutronFirmware **.
NeutronFirmware - Drives directly the execution of the NPU hardware.
More information can be found in Neutron NPU Software Tools.