Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a dish information identification method and terminal based on a YOLO neural network.
The invention is realized in such a way that a dish information identification method based on a YOLO neural network comprises the following steps:
firstly, making an xml file in a VOC format, labeling the real category and the boundary box of dishes in an image, and automatically generating the xml file required by a YOLO neural network;
secondly, converting the xml file in the VOC format into a txt file required by a YOLO neural network;
thirdly, setting training parameters of a YOLO neural network, wherein the learning rate is 0.01, the batch size is 64, the dropout is 0.25, and the iteration frequency is 10 ten thousand times;
fourthly, preprocessing the images in the training set, and sending the images with the adjusted sizes into a YOLO neural network for training;
fifthly, observing a loss curve in the training process, and judging whether the YOLO neural network is converged; if the convergence occurs, stopping training; if not, continuing training;
and sixthly, packaging the code into pyd and lib files by a pybind11 library, and calling a YOLO neural network by using a python language to realize the identification of the dishes.
Further, the dish information identification method based on the YOLO neural network performs feature extraction of dish images through a dark net-53 network.
Further, the dish information identification method based on the YOLO neural network divides the extracted feature maps with three different sizes into grids with different sizes, and carries out boundary frame prediction and category judgment on dishes with different sizes.
Further, the dish information identification method based on the YOLO neural network is characterized in that the size of an image input into the YOLO neural network is uniformly defined as 416 × 416, and feature maps of three different sizes, namely 13 × 13, 26 × 26 and 52 × 52, are obtained through a series of operations of convolution, up-sampling, residual error unit and tensor splicing.
Further, the dish information identification method based on the YOLO neural network selects corresponding prediction frame sizes from three feature maps with different sizes according to the scope of receptive fields, and selects boundary frames with 3 sizes respectively, wherein:
outputting a feature map with the size of 13 × 13, wherein the sizes of the corresponding preset template boxes are mapped to the predicted box sizes of the input image 416 × 416 to be 116 × 90, 156 × 198, 373 × 326 respectively;
outputting a feature map with the size of 26 × 26, wherein the corresponding prediction box sizes are 30 × 61, 62 × 45, 59 × 119;
the feature map with the output size of 52 × 52 corresponds to prediction box sizes of 10 × 13, 16 × 30 and 33 × 23, respectively.
Further, the dish information identification method based on the YOLO neural network further includes:
firstly, image preprocessing, namely setting the size of an input image of a YOLO neural network as 416 × 416, dividing the image into squares with corresponding size quantity according to the size of an output feature map, dividing the original input image into 13x13 grids, wherein each grid corresponds to a 3-dimensional tensor of the output 13 × 13 × 47;
step two, outputting 3 prediction frames with different sizes for the square where the center point of the dish is located, wherein the first part of the 47 output tensors is that the number of the identified dish types is 32; the number of the corresponding prediction frames is 3, and the last 12 parameters are bx, by, bw and bh which correspond to the 3 bounding boxes respectively;
and step three, performing target boundary frame prediction and category judgment according to whether the central point of the dish falls in the grid or not, and outputting the dish identification result.
Another object of the present invention is to provide a food information recognition system based on a YOLO neural network, which operates the food information recognition method based on a YOLO neural network, the food information recognition system based on a YOLO neural network including:
the image feature extraction module is used for extracting the features of the dish images through a dark net-53 network;
the characteristic diagram dividing module is used for dividing the extracted three characteristic diagrams with different sizes into grids with different sizes;
and the judging module is used for carrying out boundary frame prediction and type judgment on dishes with different sizes.
Another object of the present invention is to provide a computer program for implementing the food information identification method based on the YOLO neural network.
Another object of the present invention is to provide an information data processing terminal for implementing the food information identification method based on the YOLO neural network.
Another object of the present invention is to provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to execute the food information identification method based on the YOLO neural network.
In summary, the advantages and positive effects of the invention are: the invention provides a dish identification method based on a YOLO neural network, wherein YOLO is an excellent target detection network, and dish identification by using the network can achieve very high accuracy and very high identification speed.
The invention adopts the YOLO neural network, and the number of layers of the convolutional neural network adopted by the prior art is small, so that the performance of the prior art is poor. The method is used in the dish identification field for the first time, and the problem that the dish identification field cannot achieve both the identification accuracy and the identification speed is solved; a plurality of dishes can be identified thereby. Compared with the existing dish identification method, the dish identification method provided by the invention has higher accuracy and identification speed, the accuracy is improved by 30%, and the identification speed is improved by 15%. The deep learning is automatic learning in the network training process, automatically extracts the features of the images and is not limited to the traditional manual features. The dish identification is carried out by adopting a deep learning GPU video card acceleration method, and the dish identification speed of a video with a 1080P resolution can reach more than 30 FPS.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention provides a dish information identification method and a dish information identification terminal based on a YOLO neural network, and the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the dish information identification method based on the YOLO neural network according to the embodiment of the present invention includes the following steps:
s101: making an xml file in a VOC format, marking the real category and the boundary box of dishes in the image, and automatically generating the xml file required by a YOLO neural network;
s102: converting the XML file in the VOC format into a txt file required by a YOLO neural network;
s103: setting training parameters of a YOLO neural network, wherein the learning rate is 0.01, the batch size is 64, the dropout is 0.25, and the iteration number is 10 ten thousand;
s104: preprocessing the images in the training set, and sending the images with the adjusted sizes into a YOLO neural network for training;
s105: observing a loss curve in the training process, and judging whether the YOLO neural network is converged; if the convergence occurs, stopping training; if not, continuing training;
s106: and packaging the codes into pyd and lib files by a pybind11 library, and calling a YOLO neural network by using a python language to realize the identification of the dishes.
In a preferred embodiment of the invention, an xml file in VOC format is made, the software tool: LabelImg, the software can conveniently label and select dishes data sets, label the real categories and bounding boxes of the dishes in the image, and automatically generate the xml file required by the YOLO neural network.
In a preferred embodiment of the invention, the hardware environment of the training process: the CPU is Intel Xeon (R) and 20 cores, the model is E5-2640 v4, the main frequency is 2.4G Hz, and the memory is 64G. And (3) accelerating training by adopting a GPU, wherein the GPU is NVIDIA GeForceGTX 1080Ti/PCIe/SSE2, and the video memory size is 20G. Software environment of the training process: the operating system was ubuntu16.04lts, the OpenCV version was 3.3.0, and the TensorFlow version was 1.2.1.
The technical solution of the present invention is further described below with reference to the accompanying drawings.
The invention provides a method for identifying dishes, which adopts a YOLO neural network to detect the names of one or more dishes; the dish identification process comprises the following steps:
in the Residual module in fig. 2,. 1 indicates that the number of Residual network elements is 1,. 2 indicates that the number of Residual network elements is 2, and so on, the number of Residual network elements in the YOLO neural network is 1+2+8+8+4 to 23. the convolutional layer is followed by the BN regularization operation and the leak ReLU nonlinear activation function, wherein upsampling is mainly used for fusing shallow features and deep features so as to achieve better detection effect on dishes, and in order to cope with the size difference existing in different size dish images, there are 3 feature maps (13 × 13, 26 × 26, 52 × 52) with different sizes in the output part in fig. 2, wherein the feature maps with 3 sizes respectively select different length-width ratios and different areas according to the sizes of the feature maps, the feature map with different sizes in the reverse direction is deduced, that the corresponding area in the original dish is the area, that the final regression area of the dish image is obtained by using the prediction target region (the optimal regression region) for detecting the dish image.
The number of the output feature graphs has a direct relation with the number of the categories of the target to be identified, and a calculation formula (1) is as follows:
filters_num=3*(class_num+5) (1)
taking the dish identification method provided by the invention as an example for identifying 32 types of dishes, the number of feature graphs of three sizes output by the identification network for 32 types of dishes can be 111 through the formula.
In the prediction of the bounding box (bounding box), a template box (anchor box) is determined by using a dimension clustering method, and the relative coordinate of the center point of the bounding box relative to the upper left corner of the grid unit is obtained by directly predicting the relative position. The bounding box prediction process is shown in fig. 3 below.
It can be known from fig. 3 that there is a window fine-tuning process in the prediction process of the bounding box, so that the network location is more accurate, and the IOU value is increased. Predicting the coordinate value of the output frame based on the characteristic diagram to be bx、by、bw,bhI.e. the position and size of the bounding box with respect to the feature map: the formula is as follows:
bx=σ(tx)+cx(2)
by=σ(ty)+cy(3)
the learning objective of the network is th,tw,tx,tyWherein t isx,tyIs the coordinate offset value of the prediction box, th,twIs a scaling, Gx,GyIs the coordinate of the center point of the actual frame (ground route) in this feature map, Gw,GhIs the width and height of the ground channel on the feature map. Cx,CyIs the coordinate of the upper left corner of the center of the grid in the feature map, the width and the height of each gridcell in the YOLO neural network in the feature map are both 1, and P in the formulaw,PhIs the preset template box maps to the width and height in the feature map, where tx,tyAnd directly calculating the offset of the center of the boundary frame from the coordinate of the upper left corner of the center of the grid, wherein the formula is as follows:
tx=Gx-Cx(6)
ty=Gy-Cy(7)
wherein t ish,twThe ratio of the length and the width of the frame where the object is located to the length and the width of the template frame is shown as follows:
tw=log(Gw/Pw) (8)
th=log(Gh/Ph) (9)
as can be seen from the expressions (2) to (5), the position of the bounding box is determined by (t)h,tw,tx,ty) Calculated b is obtainedx,byUsing sigmoid function to calculate tx,tyCompressed to [0,1 ]]Within the interval, the target center can be effectively ensured to be in the grid unit for executing prediction, and excessive deviation is prevented. To obtain a more stable model, the predicted value of the position of the bounding box is constrained to [0,1 ]]I.e. for bx,by,bw,bhDivided by the width and height of the feature map, respectively, the formula is as follows:
bx=σ(tx)+cx/w (10)
by=σ(ty)+cy/h (11)
b after division by w, hx,by,bw,bhMultiplying the 4 values by the width and height of the picture of the input network (e.g. 416 × 416) respectively can obtain the position and size of the bounding box relative to the coordinate system (416 × 416), i.e. the desired target box can be output.
When the YOLO neural network predicts the bounding box, logistic regression is used. logistic regression is used to score the portion of the template (anchor) that is surrounded by an objective score (Objectness score), i.e., how likely the block is to be an object. This step is performed before prediction, and unnecessary anchors are removed, so that the calculation amount can be reduced.
Thus, the YOLO neural network will only operate on 1 anchor prior, i.e., the best prior. While logistic regression is used to find the highest one of the 9 anchors' priors with the highest objective score (object score). logistic regression is a linear modeling of the prior versus object score mapping using a curve.
The confidence of the YOLO neural network is defined as the probability size P that the bounding box contains the target
r(objec), and the accuracy of this bounding box. When the bounding box is background (i.e., contains no objects), P is now present
r(object) ═ 0. And when the bounding box contains an object, P
r(object) 1. The accuracy of the bounding box is represented by the IOU (intersection ratio) of the predicted box and the actual box (ground channel), and is recorded as
Confidence is defined as follows:
from the above formula, the confidence is the product of two factors, and the accuracy of the prediction box is also reflected therein.
According to the theoretical basis, the detail involved in dish identification for the YOLO neural network is described as shown in fig. 4, the specific process of dish identification for the YOLO neural network is shown in fig. 4, firstly, the image preprocessing is carried out, the size of an input image of the YOLO neural network is set to be 416 × 416, the image is divided into blocks with corresponding sizes and numbers according to the size of an output feature map, the original input image is divided into grids of 13x13 by taking 13 × 13 sizes (scale) in fig. 4 as an example, each grid corresponds to 3-dimensional output 13 × 13 × 47, such as a cuboid in fig. 4, 3 prediction frames with different sizes are output for a gray grid where a dish central point is located, wherein the first part of the output 47 tensors is the number of identified dish categories, the number of items is 32 by taking the identification of 32 dishes as an example, the number of items is 32, the number of the corresponding prediction frames is 3, the last 12 parameters are 3 corresponding boundary frames, bx, by, and the final result of dish identification is judged according to whether the grid falls in the target dish categories.
In the process of dish identification by a YOLO neural network, firstly, the characteristic extraction of a dish image is carried out through a darknet-53 network, then, extracted characteristic diagrams with three different sizes are divided into grids with different sizes, and the dish with different sizes is subjected to boundary frame prediction and type judgment.
And determining the prior number k to 9 after the YOLO neural network used in the dish identification process is clustered, and predicting dishes of the input image through preset template frames with different sizes to obtain corresponding 9 boundary frames with different sizes.
Selecting corresponding prediction frame sizes according to the scope of the receptive field by three feature maps with different sizes, and selecting boundary frames with 3 sizes respectively, wherein:
outputting a feature map with a size of 13 × 13, which is suitable for detecting large-sized dishes, such as large boiled fish, due to the largest receptive field, and the sizes of the corresponding preset template frames are mapped to the predicted frame sizes of the input image (416 × 416) of 116 × 90, 156 × 198, 373 × 326, respectively;
outputting a characteristic diagram with the size of 26 × 26, wherein the characteristic diagram is used for detecting medium-sized dishes, such as medium-sized shredded pork with fish flavor, due to the medium receptive field, and the corresponding prediction box sizes are 30 × 61, 62 × 45 and 59 × 119 respectively;
the signature of 52 × 52 is output, because of its minimal field of view, for use in detecting small dishes, such as a small bowl of rice, with corresponding predicted box sizes of 10 × 13, 16 × 30, and 33 × 23, respectively.
The YOLO neural network finally uses logistic regression to find the highest one of the object score from the 9 template boxes, i.e. the predicted bounding box that outputs the nearest real dish.
The technical effects of the present invention will be described in detail with reference to experiments.
The loss curve and IOU curve of the present invention during a certain training process are shown in FIG. 6 and FIG. 7.
The effect of the dish identification method of the present invention is shown in fig. 8 and 9.
TABLE 1 Performance of the dish identification method of the present invention
It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.