This demo need kernel version >= 5.15.
We think VIM3 C++ Demo is too complex. It is not friendly for users. So we provide a lite version. This document will help you use this lite version.
YOLOv8n is an object detection model. It uses bounding boxes to precisely draw each object in image.
Inference results on VIM3.
Inference speed test: USB camera about 137ms per frame. MIPI camera about 115ms per frame.
Download the YOLOv8 official code. ultralytics/ultralytics
$ git clone https://github.com/ultralytics/ultralytics
Refer README.md to create and train a YOLOv8n model. The version ultralytics== 8.0.86, PyTorch== 1.10.1.
We provided a docker image which contains the required environment to convert the model.
Follow Docker official docs to install Docker: Install Docker Engine on Ubuntu.
Follow the command below to get Docker image:
docker pull numbqq/npu-vim3
$ git lfs install $ git lfs clone https://github.com/khadas/aml_npu_sdk.git
$ cd aml_npu_sdk/acuity-toolkit/demo && ls aml_npu_sdk/acuity-toolkit/demo$ ls 0_import_model.sh 1_quantize_model.sh 2_export_case_code.sh data dataset_npy.txt dataset.txt extractoutput.py inference.sh input.npy model
After training the model, modify ultralytics/ultralytics/nn/modules/head.py as follows.
diff --git a/ultralytics/nn/modules/head.py b/ultralytics/nn/modules/head.py index 0b02eb3..0a6e43a 100644 --- a/ultralytics/nn/modules/head.py +++ b/ultralytics/nn/modules/head.py @@ -42,6 +42,9 @@ class Detect(nn.Module): def forward(self, x): """Concatenates and returns predicted bounding boxes and class probabilities.""" + if torch.onnx.is_in_onnx_export(): + return self.forward_export(x) + shape = x[0].shape # BCHW for i in range(self.nl): x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1) @@ -80,6 +83,15 @@ class Detect(nn.Module): a[-1].bias.data[:] = 1.0 # box b[-1].bias.data[:m.nc] = math.log(5 / m.nc / (640 / s) ** 2) # cls (.01 objects, 80 classes, 640 img) + def forward_export(self, x): + results = [] + for i in range(self.nl): + dfl = self.cv2[i](x[i]).contiguous() + cls = self.cv3[i](x[i]).contiguous() + results.append(torch.cat([cls, dfl], 1).permute(0, 2, 3, 1)) + return tuple(results) +
If you pip-installed ultralytics package, you should modify in package.
Create a python file written as follows to export ONNX model.
from ultralytics import YOLO model = YOLO("./runs/detect/train/weights/best.pt") results = model.export(format="onnx")
$ python export.py
Use  Netron to check your model output like this. If not, please check your head.py.
Enter aml_npu_sdk/acuity-toolkit/demo and put yolov8n.onnx into demo/model. Modify 0_import_model.sh, 1_quantize_model.sh and 2_export_case_code.sh as follows.
#!/bin/bash
 
NAME=yolov8n
ACUITY_PATH=../bin/
 
pegasus=${ACUITY_PATH}pegasus
if [ ! -e "$pegasus" ]; then
    pegasus=${ACUITY_PATH}pegasus.py
fi
 
#Onnx
$pegasus import onnx \
    --model  ./model/${NAME}.onnx \
    --output-model ${NAME}.json \
    --output-data ${NAME}.data 
 
#generate inpumeta  --source-file dataset.txt
$pegasus generate inputmeta \
	--model ${NAME}.json \
	--input-meta-output ${NAME}_inputmeta.yml \
	--channel-mean-value "0 0 0 0.0039215"  \
	--source-file dataset.txt
#!/bin/bash
 
NAME=yolov8n
ACUITY_PATH=../bin/
 
pegasus=${ACUITY_PATH}pegasus
if [ ! -e "$pegasus" ]; then
    pegasus=${ACUITY_PATH}pegasus.py
fi
 
#--quantizer asymmetric_affine --qtype  uint8
#--quantizer dynamic_fixed_point  --qtype int8(int16,note s905d3 not support int16 quantize) 
# --quantizer perchannel_symmetric_affine --qtype int8(int16, note only T3(0xBE) can support perchannel quantize)
$pegasus  quantize \
	--quantizer dynamic_fixed_point \
	--qtype int8 \
	--rebuild \
	--with-input-meta  ${NAME}_inputmeta.yml \
	--model  ${NAME}.json \
	--model-data  ${NAME}.data
#!/bin/bash
 
NAME=yolov8n
ACUITY_PATH=../bin/
 
pegasus=$ACUITY_PATH/pegasus
if [ ! -e "$pegasus" ]; then
    pegasus=$ACUITY_PATH/pegasus.py
fi
 
$pegasus export ovxlib\
    --model ${NAME}.json \
    --model-data ${NAME}.data \
    --model-quantize ${NAME}.quantize \
    --with-input-meta ${NAME}_inputmeta.yml \
    --dtype quantized \
    --optimize VIPNANOQI_PID0X88  \
    --viv-sdk ${ACUITY_PATH}vcmdtools \
    --pack-nbg-unify
 
rm -rf ${NAME}_nbg_unify
 
mv ../*_nbg_unify ${NAME}_nbg_unify
 
cd ${NAME}_nbg_unify
 
mv network_binary.nb ${NAME}.nb
 
cd ..
 
#save normal case demo export.data 
mkdir -p ${NAME}_normal_case_demo
mv  *.h *.c .project .cproject *.vcxproj BUILD *.linux *.export.data ${NAME}_normal_case_demo
 
# delete normal_case demo source
#rm  *.h *.c .project .cproject *.vcxproj  BUILD *.linux *.export.data
 
rm *.data *.quantize *.json *_inputmeta.yml
If you use VIM3L, optimize use VIPNANOQI_PID0X99.
After modifying, return to aml_npu_sdk and run convert-in-docker.sh.
If run succeed, converted model and library will generate in demo/yolov8n_nbg_unify.
$ cd ../../ $ bash convert-in-docker.sh $ cd acuity-toolkit/demo/yolov8n_nbg_unify $ ls BUILD main.c makefile.linux nbg_meta.json vnn_global.h vnn_post_process.c vnn_post_process.h vnn_pre_process.c vnn_pre_process.h vnn_yolov8n.c vnn_yolov8n.h yolov8n.nb yolov8n.vcxproj
Get the source code: khadas/vim3_npu_applications_lite
$ git clone https://github.com/khadas/vim3_npu_applications_lite
$ sudo apt update $ sudo apt install libopencv-dev python3-opencv cmake
Put yolov8n.nb into vim3_npu_applications_lite/yolov8n_demo_x11_usb/nn_data.
Replace yolov8n_demo_x11_usb/vnn_yolov8n.c and yolov8n_demo_x11_usb/include/vnn_yolov8n.h with your generating vnn_yolov8n.c and vnn_yolov8n.h.
# Compile $ cd vim3_npu_applications_lite/yolov8n_demo_x11_usb $ bash build_vx.sh $ cd bin_r_cv4 # usb $ ./yolov8n_demo_x11_usb -m ../nn_data/yolov8n_88.nb -t usb -d /dev/video0 # mipi $ ./yolov8n_demo_x11_usb -m ../nn_data/yolov8n_88.nb -t mipi -d /dev/video50
Put yolov8n.nb into vim3_npu_applications_lite/yolov8n_demo_x11_usb_multithreading/nn_data.
Replace yolov8n_demo_x11_usb_multithreading/vnn_yolov8n.c and yolov8n_demo_x11_usb_multithreading/include/vnn_yolov8n.h with your generating vnn_yolov8n.c and vnn_yolov8n.h.
# Compile $ cd vim3_npu_applications_lite/yolov8n_demo_x11_usb_multithreading $ bash build_vx.sh $ cd bin_r_cv4 # usb $ ./yolov8n_demo_x11_usb_multithreading -m ../nn_data/yolov8n_88.nb -t usb -d /dev/video0 -n 2 # mipi $ ./yolov8n_demo_x11_usb_multithreading -m ../nn_data/yolov8n_88.nb -t mipi -d /dev/video50 -n 2
The last parameter, -n, is the number of threads.