CN112488301A

CN112488301A - Food inversion method based on multitask learning and attention mechanism

Info

Publication number: CN112488301A
Application number: CN202011426511.1A
Authority: CN
Inventors: 孙成林; 白洪涛; 蔡芷薇; 何丽莉; 曹英晖
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-03-12
Anticipated expiration: 2040-12-09
Also published as: CN112488301B

Abstract

The invention discloses a food inversion method based on multitask learning and attention mechanism, which comprises the following steps: step 1, collecting food data and constructing a menu data set; step 2, establishing and training a food material text model based on an attention mechanism, and obtaining a corresponding food material text by inputting a food picture; step 3, establishing and training a menu generation model, and obtaining a menu text corresponding to the food picture by inputting the food picture and the food material text; step 4, converting the food material texts and the menu texts into corresponding food material vectors and menu vectors respectively, and establishing and training a multitask convolution neural network model; and inputting a food picture to be detected in the multitask convolutional neural network model to further obtain the food classification, the calorie value, the food material vector and the menu vector corresponding to the food picture to be detected.

Description

Food inversion method based on multitask learning and attention mechanism

Technical Field

The invention relates to the technical field of image recognition, in particular to a food inversion method based on multi-task learning and attention mechanism.

Background

In recent years we have witnessed a number of outstanding achievements on the study of visual recognition tasks, including image classification, entity recognition and image semantic segmentation. However, compared with the general image recognition task, the food image understanding is faced with a more difficult challenge because the food and the food components thereof are subjected to various cutting and cooking operations, and the shape, texture, color and the like of the food are changed variously, and the food components in the dish are often shielded from each other. Thus, the challenges faced by food image analysis have surpassed the pure computer vision task.

One early food material identification model was PFD (paired local feature distribution), which uses the results of food material prediction for food classification. In PFD, based on the appearance of the image block, the pixel points are labeled as categories of food materials. The spatial relationship between the pixels is then modeled as a multi-dimensional histogram, featuring the tags co-occurrence of their geometric properties, e.g., distance and orientation. From these histograms, PFD showed impressive food recognition performance. However, PFD has almost no scalability in food material categories, only uses 8 categories, and obviously cannot meet the application requirements in real life when the food materials are diverse.

Recipe generation based on food images is designed as a retrieval task. By calculating the similarity of the food images in the embedding space, the system retrieves the corresponding menu from an existing data set. However, the performance of such systems is highly dependent on the amount and diversity of data of the retrieved data set, as well as the quality of the network-learned embedded vectors. Further, this system cannot retrieve recipe information other than the data set.

Regarding the estimation of food calories, the mainstream method at present is to predict the calories generated by food products according to the food category and the volume thereof. The food calorie estimation method based on the depth camera predicts the food quantity by taking a food picture through the depth camera so as to obtain the predicted calorie value of the food picture. However, the depth camera is a special device, and people have difficulty in using the depth camera in daily life.

DietCam is a mobile application that estimates calories of food from multiple pictures. The method carries out semantic segmentation and image recognition on food images, reconstructs the 3D volume of the food, and predicts the food calorie based on the 3D volume. The 3D reconstruction operation is performed by SIFT-based keypoint matching and homography estimation; the food calorie prediction system proposed by pouudzadhee et al requires that pictures be taken from both the top and side of the food item, with the user's thumb as a reference. The method estimates the volume of the food product by multiplying the height predicted from the top view image by the width predicted from the side view. The method for estimating the food volume by using a plurality of images usually needs to calibrate a camera or adjust a shooting angle, and has high operation difficulty for a user and complex flow.

The calorie value contained in the food mainly depends on the type, volume, food material and cooking method of the food. Sometimes the same category of food contains different calories because they use different ingredients and cooking methods. Therefore, the food calorie prediction task cannot be completely solved only by identifying the food category and volume, and the prediction accuracy rate is to be improved.

Disclosure of Invention

The invention designs and develops a food inversion method based on a multitask learning and attention mechanism, and aims to solve the problem that an indexing type menu generation model depends on a data set and the problem that the calorie prediction accuracy is low due to the fact that factors such as food materials and cooking methods of food are not considered.

The technical scheme provided by the invention is as follows:

a food inversion method based on multitask learning and attention mechanism comprises the following steps:

step 1, collecting food data and constructing a menu data set;

step 2, establishing and training a food material text model based on an attention mechanism, and obtaining a corresponding food material text by inputting a food picture;

step 3, establishing and training a menu generation model, and obtaining a menu text corresponding to the food picture by inputting the food picture and the food material text;

step 4, converting the food material texts and the menu texts into corresponding food material vectors and menu vectors respectively, and establishing and training a multitask convolution neural network model;

and inputting a food picture to be detected in the multitask convolutional neural network model to further obtain the food classification, the calorie value, the food material vector and the menu vector corresponding to the food picture to be detected.

Preferably, in the step 2, the process of establishing the text model of the food material through the Transformer model includes:

taking the characteristic vector of the food picture as input, and outputting the characteristic vector as a sequence L ═ (L) for generating food materials₀,…,l_k,…,l_K) In the formula I_kRepresents one food material in the sequence.

Preferably, in the step 2, the displaying the generated food material corresponding to the food picture by a list structure includes:

determining a dictionary containing N food material elements as

Selecting K elements from the dictionary D to generate a food material list

Encode L as a K × N dimensional binary matrix L when d_jWhen E is selected, L _i,j1, otherwise L_i,j＝0；

The training data of the food material text model comprises M pairs of food material images and food material lists

The optimization target of the food material text model is

In the formula (I), the compound is shown in the specification,

for the target matrix predicted from image x, θ_IAnd theta_LLearnable parameters of an image encoder and a food material decoder, respectively;

will be provided with

Decomposition into K conditional sentences:

and specify

Probability distribution for food material classification.

Preferably, in the step 2, the food material text model is established through a Transformer model, and data optimization is performed through an Adam optimizer: set up beta₁＝0.9，β₂0.99, e 1e-8, set learning rate 0.001, where pre-training residual network stratigraphyThe learning rate is 0.0001; training the maximum training round of 200 rounds, setting the probability to be 50 by using an early stop method, and if the iou standard of the verification data is not improved after 50 rounds of training, executing early stop; where, batch _ size is set to 128 and num _ works is set to 4.

Preferably, in the step 3, the process of establishing a menu text model through a Transformer model includes:

taking the characteristic vector of the food picture and the characteristic vector of the food material text as input, and outputting a sequence R ═ R (R) for generating a menu₁,…,r_t,…,r_T) In the formula, r_tIs a word in the sequence.

Preferably, in the step 2, a recipe text model is established through a Transformer model, and data optimization is performed through an Adam optimizer: beta is a₁＝0.9，β₂Setting the initial learning rate to be 0.001, wherein the epsilon is 1e-8, and the initial learning rate is attenuated once every ten turns, and the attenuation factor is 0.99; the maximum training round of the training is 200 rounds, an early stopping method is used, the probability is set to be 50, and if the iou standard of the verification data is not improved after 50 rounds of training, early stopping is executed; where, batch _ size is set to 128 and num _ works is set to 4.

Preferably, in the step 4, the establishing and training of the multitask convolutional neural network model comprises the following steps:

step 4.1, collecting sample data and constructing a training sample set and a verification test sample set;

step 4.2, building a multitask convolution neural network model,

step 4.3, obtaining a loss function of the training multitask convolution neural network model:

in the formula, L_cal，L_cat，L_ing，L_dirLoss functions of four tasks of calorie prediction, food classification, food material prediction and menu prediction are respectively, lambda_cat，λ_ingAnd λ_dirAre respectively asThe weight values of loss functions of food classification, food material prediction and recipe prediction are calculated, and N is the total number of learning data;

4.4, training the multitask convolution neural network model:

setting the initial training time, and initializing each weight value to 1;

inputting the picture characteristic vectors in the training sample set into the multitask convolutional neural network model to obtain the output of the multitask convolutional neural network model as a loss function between a food classification, a calorie value, a food material text vector and a menu text vector and a corresponding real value, and calculating the food classification, the calorie value, the food material text vector and the menu text vector, so that the loss is reduced to the minimum, and the training is stopped to obtain the trained multitask convolutional neural network model;

in the multi-task convolutional neural network model training process, the loss value of each iteration is saved, and finally the reciprocal of the average loss value of all iterations is used as the weight of each task loss function;

and 4.5, testing the prediction accuracy of the trained multitask convolution neural network model on a test data set.

Preferably, in the step 4.2, a VGG16 model is used as a basic network model for building the multitask convolutional neural network model.

Preferably, in said step 4.3, said calorie prediction loss function L_calIs composed of

L_cal＝λ_reL_re+λ_abL_ab；

In the formula, L_abAs absolute error, L_reAs a relative error, λ_re，λ_abPredicting a weight value of the loss function for the calorie;

loss function L of said food classification_catIs composed of

In the formula, y_cat(k) Predicted value, g, of unit k for food image x_cat(k) Is a binary value; when cell i is the correct value, g is set_cat(k) 1 is ═ 1; when the cell i is not the correct value, g is set_cat(k) 0, n is the number of food categories;

loss function L of the food material prediction_ingIs composed of

In the formula, y_ing(k) Is the k-th dimensional predicted value of the model, g_ing(k) Is k-dimension actual value;

loss function L of the menu vector_ingIs composed of

In the formula, y_dir(k) Is the k-th dimensional predicted value of the model, g_dir(k) Is the k-dimension output value.

Preferably, the food material samples are respectively converted into corresponding food material vectors v _ ing through Word2vec model_jIs composed of

Where K is the number of food materials, word2vec (w)_k) Is the corresponding w obtained by Word2Vec_kThe real number vector of (1), tfidf_k,jIs a sample r_jMiddle w_kTf-idf value of;

respectively converting the menu samples into corresponding menu vectors v _ dir through a Word2vec model_jIs composed of

Where T is the number of menu words in the sample, word2vec (v)_t) Is through WCorresponding v from ord2Vec_tThe real number vector of (1), tfidf_k,jIs a sample r_jMiddle v_tTf-idf value of (a).

Compared with the prior art, the invention has the following beneficial effects:

1. the method is based on a deep learning method, a generation-type menu prediction model is designed, the dependence of a traditional retrieval-type menu prediction model on a menu name-menu comparison data set is solved, namely when a menu name corresponding to a food image does not exist in a database, a trained model can also generate a reasonable menu text according to image information;

2. the method is based on the multitask convolutional neural network, so that the calorie value is directly predicted according to the food image, the food volume in the image does not need to be calculated first, and the accuracy of calorie prediction is effectively improved; special shooting equipment is not needed, and the complexity of the model and the use threshold of a user are reduced;

3. the method is based on the deep learning method, so that the corresponding menu text is directly generated according to the food image, a user does not need to input other auxiliary information, and the operation complexity of the user is reduced; in view of the fact that the food materials and cooking methods of the same dish in different regions are possibly different, compared with an indexing type recipe prediction model, the model can learn the differences according to food images, and the recipe generation accuracy is higher.

Drawings

Fig. 1 is a schematic diagram of a recipe generation model according to the present invention.

FIG. 2 is a schematic diagram of a calorie prediction model according to the present invention.

Fig. 3 is a flow chart of the general implementation of the system according to the present invention.

Fig. 4 is an overall framework diagram of the system according to the present invention.

FIG. 5 is a schematic diagram of a multitasking convolutional neural network according to the present invention.

FIG. 6 is a schematic view of a multi-modal attention model according to the present invention.

Fig. 7 is a schematic view of a food material encoder according to the present invention.

Fig. 8 is a schematic diagram of a menu decoder according to the present invention.

FIG. 9 is a schematic diagram of a calorie prediction model of the single-task convolutional neural network according to the present invention.

FIG. 10 is a schematic diagram of a calorie prediction model of the multitask convolutional neural network according to the present invention.

Detailed Description

The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.

The invention provides a food inversion system based on multitask learning and attention mechanism, which comprises the following steps:

step one, collecting food data and constructing a menu data set;

preferably, in the implementation, a web crawler technology is used to crawl 227310 data containing food images, food categories, contained food materials, corresponding recipes and calorie labels from a commercial recipe website; the three selected recipe websites are respectively as follows:

(1)http://allrecipes.com/；

(2)http://www.lettuceclub.net/recipe/；

(3)http://www.orangepage.net/；

in step one, cross-domain noise data and cross category noise counts are excluded by:

1. removing image corresponding samples smaller than 80 KB;

2. removing more than 8 or less than 3 samples in the recipe practice step;

3. removing food categories with the corresponding samples less than 100 in number, and removing the corresponding samples;

the invention standardizes the dish names and food material names of all samples: the main targets of the method are a fine-grained food classification task and a food material identification task, and the food categories and food material names are deleted or combined; for example, the present invention retains the different types of pasta and cake names in the dataset as a separate category, but removes the broader names "pasta" and "cake" from the dataset; therefore, the naming of food materials is not always consistent; for example, the food material "tomato" is replaced by "tomato";

preferably, in the present embodiment, the food material names are unified as shown in table 1;

TABLE 1 food material name UNIFIED TABLE

As a preferred, in the present invention, the data set contains 227310 samples in total, wherein 281 fine-grained food categories, 1520 food materials, 28552 menu words, and the unit of calorie value is kilocalories (kcal) per serving;

in the research, the size of the image is adjusted to be 256 multiplied by 256, and the image is randomly cut into 224 multiplied by 224 for training when the image is input into a model;

establishing and training an encoder-decoder model based on an attention mechanism to obtain food material texts corresponding to the food pictures;

in food material identification, an attention-based encoder-decoder model is adopted, and comprises an image encoder and a food material decoder:

the image encoder is used for encoding the food image into a feature vector; in the present implementation, as a preferable choice, a residual network of 50 layers is selected, and the network generated image coding dimension is 512;

the food material decoder is based on a Tensorflow 2.0.0 framework on a Windows 10 system, a coder-decoder model consisting of a Transformer module is constructed by using a Python3.5.6 programming language, and the image feature vectors generated by the image coder are decoded to generate food material texts.

The invention adopts a Transformer structure to convert a food image x⁽ⁱ⁾As input, the goal is to generate a sequence of food materials L ═ (L) through a model₀,…,l_k,…,l_K) Wherein l is_kRepresents one food material in the sequence.

The food material decoder consists of 4 Transformer modules and a softmax nonlinear layer, wherein each module comprises 2 attention (attention) layers and a linear layer; the first attention layer is used for carrying out self-attention calculation on the output of the last time step, and the second attention layer is used for adjusting the output of self-attention;

the invention adopts a list structure to represent the food material text corresponding to a food image, the length of the list is variable, and certain sequence relation exists among list elements.

In the next step, a dictionary containing N food material elements is defined as

By selecting K elements from the dictionary D, a list of food materials can be generated

Encode L as a K × N dimensional binary matrix L when d_jWhen E is selected, L _i,j1, otherwise L _i,j0; thus, the training data of the encoder-decoder model comprises M pairs of food image and food material list

The optimization goals of the model are:

wherein the content of the first and second substances,

for the target matrix predicted from image x, θ_IAnd theta_LLearnable parameters of an image encoder and a food material decoder, respectively. Since L refers to a list, can

Factorization into K conditional sentences:

and specify

Probability distribution for food material classification;

the image encoder and the food material decoder are trained together, and the probability distribution and the network model are trained and adjusted through an Adam optimizer; preferably, in this embodiment, an early-stop method is used, an early-stop monitoring index is set as a verification loss, and if the verification loss does not decrease within 50 rounds, the training is stopped;

in the embodiment, the food material identification model is trained by using a self-constructed food data set; in the training process, the invention adopts data enhancement: performing random cropping (crop) and specified scaling (rescale) on the input sample image; the invention selects Adam optimizer (beta)₁＝0.9，β₂0.99, e 1e-8), setting a learning rate of 0.001, wherein the learning rate of the pre-training residual network layer is 0.0001; the maximum training round of the training is 200 rounds, an early stopping method is used, the probability is set to be 50, and if the iou standard of the verification data is not improved after 50 rounds of training, early stopping is executed; the batch _ size of the model training is set to 128, and num _ workers is set to 4;

step three, establishing and training a menu generation model so as to obtain a menu text corresponding to the food image to be detected;

based on the Tensorflow 2.0.0 framework on the Windows 10 system, the Python3.5.6 programming language is used to input the food image and the food material list as the model at the same time for generating the menu text.

The menu decoder encodes the food image e_IAnd food material code e_LAs input, eyeThe standard is that the menu sequence R is generated by a model (R)₁,…,r_t,…,r_T) Wherein r is_tRefers to a word in the sequence;

obtaining a 512-dimensional food image vector e through the 50-layer residual error network encoder in the second step_I(ii) a In the second step, the food material decoder generates a sequence L ═ (L) of the food materials₀,…,l_k,…,l_K) Then, an embedding layer can map the food material text into a 512-dimensional vector e_L。

The recipe decoder consists of 16 transform modules, each module containing 2 attention (attention) layers and one linear layer, and softmax nonlinear layers. The first attention layer is used for carrying out self-attention calculation on the output of the last time step, and the second attention layer is used for adjusting the output of self-attention;

the model contains two inputs: image feature vector

And food material coding

Wherein K is the number of food materials, d_eIs the dimension of the vector; the strategy adopted by the invention is to combine two attention layers to handle the problem of two modality input, one of which accepts the image encoding e_IThe other layer receives the food material code e_LThe outputs of the two attention layers are combined by a summation method.

In this embodiment, the present invention trains the encoder-decoder model using its own built set of food data; in the training process, the invention adopts data enhancement: performing random cropping (crop) and specified scaling (rescale) on the input sample image; adam optimizer (beta) was selected for this study₁＝0.9，β₂0.99, e 1e-8), setting an initial learning rate of 0.001, attenuating every ten rounds with an attenuation factor of 0.99; the maximum training round of the training is 200 rounds, an early stopping method is used, the probability is set to be 50, and if the iou standard of the verification data is not improved after 50 rounds of training, early stopping is executed; book (I)The batch _ size of the model training is set to 128, num _ works is set to 4;

establishing and training a multitask convolutional neural network model to further obtain the food category and the calorie value corresponding to the food image to be detected;

in the invention, the food calorie estimation is designed as a regression problem, namely, a food image is input, and a model outputs a corresponding calorie value;

according to the method, a given food image only contains one food, and one person is output when the prediction standard of the calorie of the food is standard, and as for the prediction task of food and materials, food material information is converted into Word vectors from Word2Vec and used for training a multi-task convolutional neural network model; in addition, in the menu prediction task, sentence texts of the practice step are also converted into vectors for model training.

The multi-task convolutional neural network architecture designed by the invention is mainly based on VGG16, and simultaneously trains the tasks of food calorie prediction, food classification, food material prediction and menu prediction; the fully connected layer (fc6) of the network is shared by all tasks, the transition layer (fc7) branching to each task, so each task has a transition layer (fc7) and an output layer (fc8), respectively;

set up L_c，L_cat，L_ing，L_dirSetting N for loss functions of four tasks of calorie prediction, food classification, food material prediction and menu prediction respectively_mulIs the total number of learning data; the loss function of the entire model is expressed as:

set up L_abFor absolute loss, L_reFor a relative loss, then L_calIs defined as:

L_cal＝λ_reL_re+λ_abL_ab；

wherein λ is_re，λ_ab，λ_cat，λ_ingAnd λ_dirRepresenting loss functions of four tasks respectivelyA weight value; when the model is trained for the first time, each weight value is initialized to 1, the loss value of each iteration is saved in the model training process, and finally the reciprocal of the average loss value of all iterations is used as the weight of each task loss function;

the food classification model connects 4096-dimensional fc7b layers and one fc8b, fc8b layer after fc6 full connection layer of vgg16 model, each cell corresponding to one food category.

Setting y_cat(k) Setting g for the predicted value of the unit k of the food image x_cat(k) Is a true value, then L_catIs defined as:

wherein, g_cat(k) Is a binary value; when cell i is the correct value, g is set_cat(k) 1 is ═ 1; when the cell i is not the correct value, g is set_cat(k) 0, n is the number of food categories; in the present embodiment, as a preference, for example, when the food sorting task contains 20 kinds of food, n is set to 20.

The food calorie prediction task correspondence model comprises a 4096-dimensional fc7a layer and a one-dimensional fc8a output layer for outputting the predicted calorie value, and since the food calorie is a real value, the task is treated as a regression problem in which mse (mean square error) is generally selected as a loss function, and the present invention defines the loss function of the calorie prediction task as:

L_cal＝λ_reL_re+λ_abL_ab；

wherein L is_abAs absolute error, L_reAs a relative error, λ_re，λ_abIs the weight value of the loss function; the absolute error is the absolute value of the difference between the predicted value and the actual value of the calorie; the relative error is the ratio of the absolute error to the actual value; since absolute error and relative error are important indexes, the invention uses L_abAnd L_reCombine to obtain L_calThe model is trainedFor both errors to be reduced simultaneously, set y_calFor the calorie prediction value of image x, g is set_calIs the actual calorie value, then L_abAnd L_reIs defined as:

L_ab＝|y_cal-g_cal|；

in order to solve the food material prediction task, Word2Vec is used in the research to convert each Word of the food material corresponding to a sample into a vector, and each recipe contains a plurality of food materials, so that the weighted sum value of all the obtained food material vectors is calculated, the food material information of the sample is represented, namely the linear combination of the Word2Vec vectors of all the food materials contained in the food is represented, and the calculated vector is used as the representation of the food material information; in the case of using a food material vector as training data, it is difficult to identify each independent food material contained in a food through a food image, and since the food material contained in the food is not identified through the food image, but the calorie contained in the food, the method is expected to improve the accuracy of food classification and food calorie prediction through a multitask convolutional neural network model, and obtain the effect of synchronous multitask learning, and therefore the food material prediction model is trained through the method.

The invention uses a Word2vec model pre-trained by a large food corpus for food data preprocessing, such as removing low-frequency words and performing secondary sampling on high-frequency words; in this embodiment, the present invention trains Word2Vec using the Skip-gram model and performing negative sampling;

for each sample, take only the top N of tf-idf values_maxFood material word of, N_maxAverage number of food material words for the sample; finally, a food material vector of one sample can be calculated according to the tf-idf value and the Word2Vec vector. Set up w_kIs a sample r_jThe food material word in (1), sample r_jCorresponding food material vector v _ ing_jIs defined as:

wherein K is the number of food materials, word2vec (w)_k) Is the corresponding w obtained by Word2Vec_kThe real number vector of (1), tfidf_k,jIs a sample r_jMiddle w_kTf-idf value of;

in the invention, the learning process of the food material information is converted into a food material vector prediction task; the task model comprises a 4096-dimensional fc7c layer and a dimension and food material vector dimension d_IThe same output layer (fc8 c). Setting y_ing(k) Is the k-th dimensional predicted value of the model, g_ing(k) Is the k-th dimension actual value, then L_ingIs defined as:

in addition to food material prediction, the present study also used recipe (course of action) prediction as additional information for multitask learning. The same as the food material prediction task, each Word in the recipe sentence text is converted into a Word vector through Word2Vec, and then the corresponding recipe vector is obtained through weighted summation calculation. When the menu vector is generated, only nouns, verbs and adjectives in the menu sentence are used, and the words with higher tf-idf values are used. For each food sample, only the highest N of tf-idf values in the recipe statement is used_maxWord, in experiment, N_maxThe average number of words contained in the menu text in each sample is set. And finally, calculating to obtain a menu vector of each sample according to the tf-idf weight value and the Word2Vec vector. Setting v_tIs a sample r_jWord, sample r of Chinese menu text_jCorresponding menu vector v _ dir_jIs defined as:

wherein T is the number of menu words in the sample, word2vec (v)_t) Is thatCorresponding v obtained by Word2Vec_tThe real number vector of (1), tfidf_k,jIs a sample r_jMiddle v_tTf-idf value of (a).

The model is trained on the menu information through a menu vector prediction task, and the menu vector prediction task model consists of a 4096-dimensional fc7d layer and an output layer (fc8d), wherein the dimension of the output layer corresponds to the dimension of the menu vector. Setting y_dir(k) Is the k-th dimensional predicted value of the model, g_dir(k) Is the k-dimension output value, then L_dirIs defined as:

the research expands a VGG-16 model and realizes a multitask convolutional neural network, and batch standardization is used for replacing dropout at an fc6 layer and an fc7 layer; at the other levels beyond batch normalization and fc8, its initial parameters were set to vgg16 model parameters pre-trained by ImageNet in a class 1000 classification task. To optimize the CNN parameters, the study used an SGD value of 0.9 and a small batch size of 8.

For the test, in 100 iterations, the invention obtained 10 models using the time interval of the last 1000 iterations in the training, and the average of the predicted values of each model was taken as the final predicted value.

The present invention uses 70% of the data in the food data set for training and the remaining 30% for validation and testing; the set learning rate of 0.001 was iterated 50000 times, then changed to 0.0001 learning rate for 20000 iterations. In order to train the model to predict the food material vector and the menu vector, the invention trains Word2Vec with sentences of about 8,710,000 cooking steps, and the dimension n of the Word vector is 500; regarding food ingredients, the invention only uses the food ingredient words with tf-idf values in the data set sample ranked at the top 12, and creates a food ingredient vector; since Nmax is 44, i.e. the average of the word number of the menu sentence in each sample is 44, the study only uses the word with the tf-idf value in the menu sentence in each sample in the top 44; then, in order to simply consider time information, the recipe text is divided into m sentences in time order, m recipe vectors are created, and finally the divided vectors are connected.

Examples

The system directly predicts the calorie value of the food through one food image, does not need other information manually input by a user, has no requirements on the image shooting angle and the like, does not need special equipment such as a depth camera and the like, and is simpler and more convenient for the user to operate.

As shown in fig. 9 and 10, the model adopts a multi-task convolutional neural network, and learns four tasks of calorie prediction, food classification, food material prediction and recipe prediction simultaneously during training, so that the accuracy of food classification and calorie prediction is effectively improved; the experimental result shows that compared with the calorie prediction model of the single-task convolutional neural network (the correlation coefficient is 0.7217), the correlation coefficient of the model (the correlation coefficient is 0.7679) provided by the invention is improved by 0.0462.

The invention constructs a food data set which comprises 281 fine-grained food categories and 1520 food materials, and can alleviate the problem of limitation of the food and food material categories to a certain extent.

The model takes the food image and the food material text as input simultaneously, so that the accuracy rate of the menu prediction is effectively improved; as shown in table 2, the experimental results show that the test set Perplexity (Perplexity) of the model is reduced by about 0.18 compared with the recipe prediction model with single food input; compared with the menu prediction model with food image single input, the model has the advantage that the confusion degree of the test set is reduced by about 1.40.

TABLE 2 test group perplexity test results

Model (model)	Degree of confusion
		Food material single input model	8.81
Image single input model	9.53
		Generative model	8.65

Unlike traditional query-style recipe generation systems, which rely excessively on the data set for prediction accuracy, the present study designs the recipe generation task as a text generation problem.

As shown in table 3, the experimental result shows that, compared with the traditional query-type recipe prediction model, the Intersection-over-Union (IoU) of the food material identification task processed by the model is improved by more than 10, and the F1 score is improved by more than 15.

TABLE 3 results of experiments dealing with the intersection ratio of food material recognition task and F1

	IoU	F1
			Image-menu query model	18.01	29.67
Image-food material inquiry type model	17.58	27.93
			Generative model	30.59	45.12

As shown in table 4, compared with the conventional query recipe prediction model, the model processes the recipe generation task, and the accuracy (precision) and recall (recall) based on the food material information are significantly improved.

TABLE 4 Experimental results of accuracy and recall of food material information

	Recall rate	Rate of accuracy
			Image-menu query model	29.83	27.62
Image-food material inquiry type model	28.75	29.16
			Generative model	70.30	73.94

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. A food inversion method based on multitask learning and attention mechanism is characterized by comprising the following steps:

step 1, collecting food data and constructing a menu data set;

2. The multitask learning and attention mechanism-based food inversion method according to claim 1, wherein in the step 2, the process of establishing the food material text model through the Transformer model comprises the following steps:

3. The food inversion method based on multitask learning and attention mechanism according to claim 2, wherein in the step 2, the generating food material corresponding to the food picture is represented by a list structure, and the method comprises the following steps:

determining a dictionary containing N food material elements as

Selecting K elements from the dictionary D to generate a food material list

Encode L as a K × N dimensional binary matrix L when d_jWhen E is selected, L_i,j1, otherwise L_i,j＝0；

The optimization target of the food material text model is

In the formula (I), the compound is shown in the specification,

will be provided with

Decomposition into K conditional sentences:

and specify

Probability distribution for food material classification.

4. The multitask-based learning and attention machine of claim 3The food inversion method is characterized in that in the step 2, a food material text model is established through a Transformer model, and data optimization is carried out through an Adam optimizer: set up beta₁＝0.9，β₂Setting a learning rate of 0.001 for 0.99 and e to 1e-8, wherein the learning rate of the pre-training residual network layer is 0.0001; training the maximum training round of 200 rounds, setting the probability to be 50 by using an early stop method, and if the iou standard of the verification data is not improved after 50 rounds of training, executing early stop; where, batch _ size is set to 128 and num _ works is set to 4.

5. The multitask learning and attention mechanism-based food inversion method according to claim 1, wherein in the step 3, the process of establishing a menu text model through a Transformer model comprises the following steps:

6. The multitask learning and attention mechanism based food inversion method according to claim 3, wherein in the step 2, a recipe text model is established through a Transformer model, and data optimization is carried out through an Adam optimizer: beta is a₁＝0.9，β₂Setting the initial learning rate to be 0.001, wherein the epsilon is 1e-8, and the initial learning rate is attenuated once every ten turns, and the attenuation factor is 0.99; the maximum training round of the training is 200 rounds, an early stopping method is used, the probability is set to be 50, and if the iou standard of the verification data is not improved after 50 rounds of training, early stopping is executed; where, batch _ size is set to 128 and num _ works is set to 4.

7. The multitask learning and attention mechanism based food inversion method according to claim 1, wherein in the step 4, establishing and training a multitask convolutional neural network model comprises the following steps:

step 4.2, building a multitask convolution neural network model,

in the formula, L_cal，L_cat，L_ing，L_dirLoss functions of four tasks of calorie prediction, food classification, food material prediction and menu prediction are respectively, lambda_cat，λ_ingAnd λ_dirRespectively weighing values of loss functions of food classification, food material prediction and menu prediction, wherein N is the total number of learning data;

4.4, training the multitask convolution neural network model:

setting the initial training time, and initializing each weight value to 1;

8. The multitask learning and attention mechanism based food inversion method according to claim 7, characterized in that in step 4.2, a VGG16 model is used as a base network model for building the multitask convolutional neural network model.

9. The multitask learning and attention mechanism based food inversion method as claimed in claim 7, characterized in that in step 4.3, the calorie prediction loss function L_calIs composed of

L_cal＝λ_reL_re+λ_abL_ab；

loss function L of said food classification_catIs composed of

loss function L of the food material prediction_ingIs composed of

loss function L of the menu vector_ingIs composed of

10. The multitask learning and attention mechanism based food inversion method according to claim 7, wherein the food material samples are respectively converted into corresponding food material vectors v _ ing through a Word2vec model_jIs composed of

Where T is the number of menu words in the sample, word2vec (v)_t) Is the corresponding v obtained by Word2Vec_tThe real number vector of (1), tfidf_k,jIs a sample r_jMiddle v_tTf-idf value of (a).