CN111325084A

CN111325084A - Dish information identification method and terminal based on YOLO neural network

Info

Publication number: CN111325084A
Application number: CN201910806784.XA
Authority: CN
Inventors: 于文涛; 郝继伟
Original assignee: Xi'an Iridium Shiyun Catering Management Co ltd
Current assignee: Shenzhen Xiaoniu Zhixun Technology Co ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2020-06-23

Abstract

The invention belongs to the technical field of data processing, and discloses a dish information identification method and terminal based on a YOLO neural network, which are used for making an xml file in a VOC format; converting the XML file in the VOC format into a txt file required by a YOLO neural network; setting training parameters of a YOLO neural network; preprocessing the images in the training set, and sending the images with the adjusted sizes into a YOLO neural network for training; observing a loss curve in the training process, and judging whether the YOLO neural network is converged; and packaging the codes into pyd and lib files by a pybind11 library, and calling a YOLO neural network by using a python language to realize the identification of the dishes. Compared with the existing dish identification method, the dish identification method provided by the invention has higher accuracy and identification speed; the dish identification speed of the video with 1080P resolution can reach more than 30 FPS.

Description

Dish information identification method and terminal based on YOLO neural network

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a dish information identification method and terminal based on a YOLO neural network.

Background

Currently, the closest prior art: the dish refers to various kinds of dishes, such as green dish fried mushroom, green pepper shredded meat, potato braised meat and the like. The dish identification means that the name of the dish is identified. In restaurants and restaurants, waiters who are responsible for paying money need to settle the fees according to dishes ordered by customers. The traditional dish identification all relies on human eyes to identify dishes. However, because of the variety of dishes, the color, the fragrance, the taste and the shape of the dishes cooked each time are different, so that the accuracy rate of manual dish identification is low and the speed is slow. YOLO is a very excellent target detection network proposed in recent years, and can predict the category and bounding box of a target at the same time, so as to convert the target detection problem into a regression problem. The YOLO can achieve a balance between speed and performance, and achieve very high accuracy and recall while ensuring very high target detection speed.

The prior art discloses a dish identification method and a dish identification system, and dish identification is considered to be a classification problem, so that the prior art can only identify one dish at a time and cannot simultaneously identify multiple dishes which are put together. However, in the actual situation of a restaurant or a restaurant, a waiter responsible for cash collection needs to order one dish or multiple dishes for a customer to calculate a meal cost. Therefore, dish identification is a target detection problem, namely, the target detection is carried out on one dish or a plurality of dishes which are put together. The method adopts a classification method in the prior art, the classification method can only identify one dish at a time, a plurality of dishes are put together, the prior art cannot work normally, and the obtained result is meaningless. Only one dish can be identified at a time, and a plurality of dishes placed together cannot be identified at the same time.

The second prior art discloses a dish identification method, which comprises the following steps: 1) acquiring a web request, wherein the server corresponds to the web request and acquires a corresponding image; 2) saving the image, acquiring an input data stream, generating an image file name and saving the image file name to a magnetic disk; 3) image preprocessing, namely, input; resizing and normalizing the image; 4) and processing by using a pre-trained convolutional neural network, detecting and classifying objects on the image, ending if no dish is detected, and outputting corresponding dish information by combining a classification result if the dish is detected. The method adopts a shallow convolutional neural network, and the number of layers of the convolutional neural network adopted in the prior art is small, so that the prior art has poor performance; the extracted image features are limited, so that the final dish identification accuracy is low.

In summary, the problems of the prior art are as follows: in the dish identification method in the prior art, a plurality of dishes put together cannot be identified at the same time; the final dish identification accuracy is low.

The difficulty of solving the technical problems is as follows:

in the prior art, the technical problems cannot be fundamentally solved only by simple modification, and the difficulty in solving the technical problems is very high, so that only a new method can be innovated. The present invention is a great innovation of the prior art and can solve the above-mentioned technical problems.

The significance of solving the technical problems is as follows:

the significance of solving a technical problem in the prior art is that a plurality of dishes placed together can be identified at the same time. The significance of solving the two technical problems in the prior art is that the accuracy of dish identification can be improved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a dish information identification method and terminal based on a YOLO neural network.

The invention is realized in such a way that a dish information identification method based on a YOLO neural network comprises the following steps:

firstly, making an xml file in a VOC format, labeling the real category and the boundary box of dishes in an image, and automatically generating the xml file required by a YOLO neural network;

secondly, converting the xml file in the VOC format into a txt file required by a YOLO neural network;

thirdly, setting training parameters of a YOLO neural network, wherein the learning rate is 0.01, the batch size is 64, the dropout is 0.25, and the iteration frequency is 10 ten thousand times;

fourthly, preprocessing the images in the training set, and sending the images with the adjusted sizes into a YOLO neural network for training;

fifthly, observing a loss curve in the training process, and judging whether the YOLO neural network is converged; if the convergence occurs, stopping training; if not, continuing training;

and sixthly, packaging the code into pyd and lib files by a pybind11 library, and calling a YOLO neural network by using a python language to realize the identification of the dishes.

Further, the dish information identification method based on the YOLO neural network performs feature extraction of dish images through a dark net-53 network.

Further, the dish information identification method based on the YOLO neural network divides the extracted feature maps with three different sizes into grids with different sizes, and carries out boundary frame prediction and category judgment on dishes with different sizes.

Further, the dish information identification method based on the YOLO neural network is characterized in that the size of an image input into the YOLO neural network is uniformly defined as 416 × 416, and feature maps of three different sizes, namely 13 × 13, 26 × 26 and 52 × 52, are obtained through a series of operations of convolution, up-sampling, residual error unit and tensor splicing.

Further, the dish information identification method based on the YOLO neural network selects corresponding prediction frame sizes from three feature maps with different sizes according to the scope of receptive fields, and selects boundary frames with 3 sizes respectively, wherein:

outputting a feature map with the size of 13 × 13, wherein the sizes of the corresponding preset template boxes are mapped to the predicted box sizes of the input image 416 × 416 to be 116 × 90, 156 × 198, 373 × 326 respectively;

outputting a feature map with the size of 26 × 26, wherein the corresponding prediction box sizes are 30 × 61, 62 × 45, 59 × 119;

the feature map with the output size of 52 × 52 corresponds to prediction box sizes of 10 × 13, 16 × 30 and 33 × 23, respectively.

Further, the dish information identification method based on the YOLO neural network further includes:

firstly, image preprocessing, namely setting the size of an input image of a YOLO neural network as 416 × 416, dividing the image into squares with corresponding size quantity according to the size of an output feature map, dividing the original input image into 13x13 grids, wherein each grid corresponds to a 3-dimensional tensor of the output 13 × 13 × 47;

step two, outputting 3 prediction frames with different sizes for the square where the center point of the dish is located, wherein the first part of the 47 output tensors is that the number of the identified dish types is 32; the number of the corresponding prediction frames is 3, and the last 12 parameters are bx, by, bw and bh which correspond to the 3 bounding boxes respectively;

and step three, performing target boundary frame prediction and category judgment according to whether the central point of the dish falls in the grid or not, and outputting the dish identification result.

Another object of the present invention is to provide a food information recognition system based on a YOLO neural network, which operates the food information recognition method based on a YOLO neural network, the food information recognition system based on a YOLO neural network including:

the image feature extraction module is used for extracting the features of the dish images through a dark net-53 network;

the characteristic diagram dividing module is used for dividing the extracted three characteristic diagrams with different sizes into grids with different sizes;

and the judging module is used for carrying out boundary frame prediction and type judgment on dishes with different sizes.

Another object of the present invention is to provide a computer program for implementing the food information identification method based on the YOLO neural network.

Another object of the present invention is to provide an information data processing terminal for implementing the food information identification method based on the YOLO neural network.

Another object of the present invention is to provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to execute the food information identification method based on the YOLO neural network.

In summary, the advantages and positive effects of the invention are: the invention provides a dish identification method based on a YOLO neural network, wherein YOLO is an excellent target detection network, and dish identification by using the network can achieve very high accuracy and very high identification speed.

The invention adopts the YOLO neural network, and the number of layers of the convolutional neural network adopted by the prior art is small, so that the performance of the prior art is poor. The method is used in the dish identification field for the first time, and the problem that the dish identification field cannot achieve both the identification accuracy and the identification speed is solved; a plurality of dishes can be identified thereby. Compared with the existing dish identification method, the dish identification method provided by the invention has higher accuracy and identification speed, the accuracy is improved by 30%, and the identification speed is improved by 15%. The deep learning is automatic learning in the network training process, automatically extracts the features of the images and is not limited to the traditional manual features. The dish identification is carried out by adopting a deep learning GPU video card acceleration method, and the dish identification speed of a video with a 1080P resolution can reach more than 30 FPS.

Drawings

Fig. 1 is a flowchart of a dish information identification method based on a YOLO neural network according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a YOLO neural network provided in an embodiment of the present invention.

Fig. 3 is a schematic diagram of a calculation process of bounding box prediction according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a dish information identification method based on the YOLO neural network according to an embodiment of the present invention.

Fig. 5 is a schematic diagram illustrating changes in parameters of the YOLO neural network structure and each layer according to an embodiment of the present invention.

Fig. 6 is a schematic view of a loss curve in the process of identifying and training the YOLO neural network dishes according to the embodiment of the present invention.

Fig. 7 is a schematic view of an IOU curve in the process of identifying and training the YOLO neural network dishes according to the embodiment of the present invention.

Fig. 8 is an original image input by the dish identification method according to the embodiment of the present invention.

Fig. 9 is a schematic diagram of an identification result output by the dish identification method according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a dish information identification method and a dish information identification terminal based on a YOLO neural network, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the dish information identification method based on the YOLO neural network according to the embodiment of the present invention includes the following steps:

s101: making an xml file in a VOC format, marking the real category and the boundary box of dishes in the image, and automatically generating the xml file required by a YOLO neural network;

s102: converting the XML file in the VOC format into a txt file required by a YOLO neural network;

s103: setting training parameters of a YOLO neural network, wherein the learning rate is 0.01, the batch size is 64, the dropout is 0.25, and the iteration number is 10 ten thousand;

s104: preprocessing the images in the training set, and sending the images with the adjusted sizes into a YOLO neural network for training;

s105: observing a loss curve in the training process, and judging whether the YOLO neural network is converged; if the convergence occurs, stopping training; if not, continuing training;

s106: and packaging the codes into pyd and lib files by a pybind11 library, and calling a YOLO neural network by using a python language to realize the identification of the dishes.

In a preferred embodiment of the invention, an xml file in VOC format is made, the software tool: LabelImg, the software can conveniently label and select dishes data sets, label the real categories and bounding boxes of the dishes in the image, and automatically generate the xml file required by the YOLO neural network.

In a preferred embodiment of the invention, the hardware environment of the training process: the CPU is Intel Xeon (R) and 20 cores, the model is E5-2640 v4, the main frequency is 2.4G Hz, and the memory is 64G. And (3) accelerating training by adopting a GPU, wherein the GPU is NVIDIA GeForceGTX 1080Ti/PCIe/SSE2, and the video memory size is 20G. Software environment of the training process: the operating system was ubuntu16.04lts, the OpenCV version was 3.3.0, and the TensorFlow version was 1.2.1.

The technical solution of the present invention is further described below with reference to the accompanying drawings.

The invention provides a method for identifying dishes, which adopts a YOLO neural network to detect the names of one or more dishes; the dish identification process comprises the following steps:

in the Residual module in fig. 2,. 1 indicates that the number of Residual network elements is 1,. 2 indicates that the number of Residual network elements is 2, and so on, the number of Residual network elements in the YOLO neural network is 1+2+8+8+4 to 23. the convolutional layer is followed by the BN regularization operation and the leak ReLU nonlinear activation function, wherein upsampling is mainly used for fusing shallow features and deep features so as to achieve better detection effect on dishes, and in order to cope with the size difference existing in different size dish images, there are 3 feature maps (13 × 13, 26 × 26, 52 × 52) with different sizes in the output part in fig. 2, wherein the feature maps with 3 sizes respectively select different length-width ratios and different areas according to the sizes of the feature maps, the feature map with different sizes in the reverse direction is deduced, that the corresponding area in the original dish is the area, that the final regression area of the dish image is obtained by using the prediction target region (the optimal regression region) for detecting the dish image.

The number of the output feature graphs has a direct relation with the number of the categories of the target to be identified, and a calculation formula (1) is as follows:

filters_num＝3*(class_num+5) (1)

taking the dish identification method provided by the invention as an example for identifying 32 types of dishes, the number of feature graphs of three sizes output by the identification network for 32 types of dishes can be 111 through the formula.

In the prediction of the bounding box (bounding box), a template box (anchor box) is determined by using a dimension clustering method, and the relative coordinate of the center point of the bounding box relative to the upper left corner of the grid unit is obtained by directly predicting the relative position. The bounding box prediction process is shown in fig. 3 below.

It can be known from fig. 3 that there is a window fine-tuning process in the prediction process of the bounding box, so that the network location is more accurate, and the IOU value is increased. Predicting the coordinate value of the output frame based on the characteristic diagram to be b_x、b_y、b_w，b_hI.e. the position and size of the bounding box with respect to the feature map: the formula is as follows:

b_x＝σ(t_x)+c_x(2)

b_y＝σ(t_y)+c_y(3)

the learning objective of the network is t_h，t_w，t_x，t_yWherein t is_x，t_yIs the coordinate offset value of the prediction box, t_h，t_wIs a scaling, G_x，G_yIs the coordinate of the center point of the actual frame (ground route) in this feature map, G_w，G_hIs the width and height of the ground channel on the feature map. C_x，C_yIs the coordinate of the upper left corner of the center of the grid in the feature map, the width and the height of each gridcell in the YOLO neural network in the feature map are both 1, and P in the formula_w，P_hIs the preset template box maps to the width and height in the feature map, where t_x，t_yAnd directly calculating the offset of the center of the boundary frame from the coordinate of the upper left corner of the center of the grid, wherein the formula is as follows:

t_x＝G_x-C_x(6)

t_y＝G_y-C_y(7)

wherein t is_h，t_wThe ratio of the length and the width of the frame where the object is located to the length and the width of the template frame is shown as follows:

t_w＝log(G_w/P_w) (8)

t_h＝log(G_h/P_h) (9)

as can be seen from the expressions (2) to (5), the position of the bounding box is determined by (t)_h，t_w，t_x，t_y) Calculated b is obtained_x，b_yUsing sigmoid function to calculate t_x，t_yCompressed to [0,1 ]]Within the interval, the target center can be effectively ensured to be in the grid unit for executing prediction, and excessive deviation is prevented. To obtain a more stable model, the predicted value of the position of the bounding box is constrained to [0,1 ]]I.e. for b_x，b_y，b_w，b_hDivided by the width and height of the feature map, respectively, the formula is as follows:

b_x＝σ(t_x)+c_x/w (10)

b_y＝σ(t_y)+c_y/h (11)

b after division by w, h_x，b_y，b_w，b_hMultiplying the 4 values by the width and height of the picture of the input network (e.g. 416 × 416) respectively can obtain the position and size of the bounding box relative to the coordinate system (416 × 416), i.e. the desired target box can be output.

When the YOLO neural network predicts the bounding box, logistic regression is used. logistic regression is used to score the portion of the template (anchor) that is surrounded by an objective score (Objectness score), i.e., how likely the block is to be an object. This step is performed before prediction, and unnecessary anchors are removed, so that the calculation amount can be reduced.

Thus, the YOLO neural network will only operate on 1 anchor prior, i.e., the best prior. While logistic regression is used to find the highest one of the 9 anchors' priors with the highest objective score (object score). logistic regression is a linear modeling of the prior versus object score mapping using a curve.

The confidence of the YOLO neural network is defined as the probability size P that the bounding box contains the target_r(objec), and the accuracy of this bounding box. When the bounding box is background (i.e., contains no objects), P is now present_r(object) ═ 0. And when the bounding box contains an object, P_r(object) 1. The accuracy of the bounding box is represented by the IOU (intersection ratio) of the predicted box and the actual box (ground channel), and is recorded as

Confidence is defined as follows:

from the above formula, the confidence is the product of two factors, and the accuracy of the prediction box is also reflected therein.

According to the theoretical basis, the detail involved in dish identification for the YOLO neural network is described as shown in fig. 4, the specific process of dish identification for the YOLO neural network is shown in fig. 4, firstly, the image preprocessing is carried out, the size of an input image of the YOLO neural network is set to be 416 × 416, the image is divided into blocks with corresponding sizes and numbers according to the size of an output feature map, the original input image is divided into grids of 13x13 by taking 13 × 13 sizes (scale) in fig. 4 as an example, each grid corresponds to 3-dimensional output 13 × 13 × 47, such as a cuboid in fig. 4, 3 prediction frames with different sizes are output for a gray grid where a dish central point is located, wherein the first part of the output 47 tensors is the number of identified dish categories, the number of items is 32 by taking the identification of 32 dishes as an example, the number of items is 32, the number of the corresponding prediction frames is 3, the last 12 parameters are 3 corresponding boundary frames, bx, by, and the final result of dish identification is judged according to whether the grid falls in the target dish categories.

In the process of dish identification by a YOLO neural network, firstly, the characteristic extraction of a dish image is carried out through a darknet-53 network, then, extracted characteristic diagrams with three different sizes are divided into grids with different sizes, and the dish with different sizes is subjected to boundary frame prediction and type judgment.

And determining the prior number k to 9 after the YOLO neural network used in the dish identification process is clustered, and predicting dishes of the input image through preset template frames with different sizes to obtain corresponding 9 boundary frames with different sizes.

Selecting corresponding prediction frame sizes according to the scope of the receptive field by three feature maps with different sizes, and selecting boundary frames with 3 sizes respectively, wherein:

outputting a feature map with a size of 13 × 13, which is suitable for detecting large-sized dishes, such as large boiled fish, due to the largest receptive field, and the sizes of the corresponding preset template frames are mapped to the predicted frame sizes of the input image (416 × 416) of 116 × 90, 156 × 198, 373 × 326, respectively;

outputting a characteristic diagram with the size of 26 × 26, wherein the characteristic diagram is used for detecting medium-sized dishes, such as medium-sized shredded pork with fish flavor, due to the medium receptive field, and the corresponding prediction box sizes are 30 × 61, 62 × 45 and 59 × 119 respectively;

the signature of 52 × 52 is output, because of its minimal field of view, for use in detecting small dishes, such as a small bowl of rice, with corresponding predicted box sizes of 10 × 13, 16 × 30, and 33 × 23, respectively.

The YOLO neural network finally uses logistic regression to find the highest one of the object score from the 9 template boxes, i.e. the predicted bounding box that outputs the nearest real dish.

The technical effects of the present invention will be described in detail with reference to experiments.

The loss curve and IOU curve of the present invention during a certain training process are shown in FIG. 6 and FIG. 7.

The effect of the dish identification method of the present invention is shown in fig. 8 and 9.

TABLE 1 Performance of the dish identification method of the present invention

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A dish information identification method based on a YOLO neural network is characterized by comprising the following steps:

2. The yo neural network-based dish information recognition method of claim 1, wherein the yo neural network-based dish information recognition method performs feature extraction of a dish image through a dark net-53 network.

3. The food information identification method based on the YOLO neural network of claim 1, wherein the food information identification method based on the YOLO neural network divides the extracted three feature maps with different sizes into squares with different sizes, and performs bounding box prediction and category judgment on the food with different sizes.

4. The food information identification method based on the YOLO neural network of claim 1, wherein the size of the image input to the YOLO neural network is uniformly defined as 416 × 416, and feature maps of three different sizes 13 × 13, 26 × 26 and 52 × 52 are obtained through a series of convolution, upsampling, residual unit and tensor stitching operations.

5. The food information identification method based on the YOLO neural network of claim 4, wherein feature maps of three different sizes select corresponding prediction frame sizes according to the scope of receptive fields, and each of the three different sizes selects a bounding frame of 3 sizes, wherein:

6. The yo neural network-based dish information identification method of claim 1, further comprising:

step two, outputting 3 prediction frames with different sizes for the square where the center point of the dish is positioned, wherein the prediction frames are in different sizesIn the 47 output tensors, the first part is that the number of the identified dish categories is 32; the number of the corresponding prediction frames is 3, and the last 12 parameters are b corresponding to 3 boundary frames respectively_x、b_y、b_w，b_h；

7. A food information identification system based on a YOLO neural network, which operates the food information identification method based on a YOLO neural network of any one of claims 1 to 6, wherein the food information identification system based on a YOLO neural network comprises:

8. A computer program for implementing the method for identifying dish information based on the YOLO neural network as claimed in any one of claims 1 to 6.

9. An information data processing terminal for implementing the dish information identification method based on the YOLO neural network as claimed in any one of claims 1 to 6.

10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the food information recognition method based on the YOLO neural network of any one of claims 1 to 6.