On-Device Profiling

If benchmark results on real hardware can’t meet expectations, the next step is to use the on-device profiling feature to identify bottlenecks on the actual target hardware.

On-Device Profiling Allows You To

  • Measure actual latency on the target device

  • Profile layer-by-layer execution timing

  • Detect platform-specific bottlenecks (NPU, memory bandwidth, etc.)

  • Compare performance across hardware options

  • Analyze DDR read/write bandwidth and GPU cycle utilization

Typical Devices Available

The AI Hub boardfarm provides remote access to NXP hardware. Available devices include:

  • i.MX 8M Plus, i.MX 8M family (VSI NPU/GPU)

  • i.MX 93 (Ethos-U NPU)

  • i.MX 95, i.MX 943, i.MX 952 family (eIQ Neutron NPU

  • MCX N94x, MCX N54x, i.MX RT700 (eIQ Neutron NPU)

Note

Device availability depends on the current boardfarm inventory. Only TensorFlow Lite (.tflite) models are supported for on-device profiling.

Run an On-Device Profiling Session

The On-device profiling page lets you select your target hardware, model, and software stack, then run the profiling on a physical device in the AI Hub boardfarm.

On-Device Profiling Configuration

Steps:

  1. Switch to the AI Toolkit tab in the top navigation bar.

  2. In the left sidebar, under Model evaluation, click On-device profiling.

  3. On the On-device profiling page, review the information box:

    • Only TensorFlow Lite (.tflite) models are supported.

  4. Configure the profiling parameters:

    • Select device — choose the target hardware from the available devices in the boardfarm (e.g., i.MX 8M Plus Applications Processor).

    • Select backend — choose the inference backend (e.g., npu, cpu).

    • Select model — choose the model to profile from your uploaded models (e.g., mobilenet_v1_10_224_int8).

    • Select Yocto image — choose the Yocto BSP image running on the target device (e.g., 2026 Q1, 2025 Q4).

  5. Optionally, enter a Custom run name to label this profiling session.

  6. Click the Profile model button to submit the profiling job.

  7. A confirmation message appears: “Profiling is in progress”. You can:

    • Click Profiling history to monitor progress.

    • Click Profile another to start a new profiling session.

Review Profiling Results

After the profiling session completes, navigate to Profiling history in the left sidebar and click on the entry to view the detailed results.

On-Device Profiling Results

Session metadata includes:

  • TypeOn device

  • Target — target device identifier (e.g., imx8mpevk)

  • Engine — inference engine used (e.g., NPU)

  • Tensor arena size — memory arena allocated for tensor operations

  • Model size — size of the model file

  • Total inference time — total inference time in milliseconds

Per-node profiling statistics table:

  • Node id — unique identifier for each operator node

  • Name — layer type (e.g., Convolution, BatchNorm, TensorCopy)

  • Order — execution order of the node

  • Op name — hardware-specific operation name (e.g., VXNNE_OP_*)

  • Inputs / Outputs — number of input and output tensors

  • Input shape / Output shape — tensor dimensions

  • Processing type — execution mode (e.g., Parallel Processing)

  • Execution time — execution time in milliseconds for that node

  • DDR Read / DDR Write — DDR memory read and write bandwidth usage

  • GPU Idle cycles / GPU Total cycles — GPU utilization metrics

Use these results to identify performance bottlenecks (e.g., high-latency layers, memory bandwidth saturation) and validate that the model meets your latency requirements for the target hardware.

Note

Please refer to Profiling Section for detailed information.

Next Steps