CN112488301A - Food inversion method based on multitask learning and attention mechanism - Google Patents

Food inversion method based on multitask learning and attention mechanism Download PDF

Info

Publication number
CN112488301A
CN112488301A CN202011426511.1A CN202011426511A CN112488301A CN 112488301 A CN112488301 A CN 112488301A CN 202011426511 A CN202011426511 A CN 202011426511A CN 112488301 A CN112488301 A CN 112488301A
Authority
CN
China
Prior art keywords
food
food material
model
menu
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011426511.1A
Other languages
Chinese (zh)
Other versions
CN112488301B (en
Inventor
孙成林
白洪涛
蔡芷薇
何丽莉
曹英晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202011426511.1A priority Critical patent/CN112488301B/en
Publication of CN112488301A publication Critical patent/CN112488301A/en
Application granted granted Critical
Publication of CN112488301B publication Critical patent/CN112488301B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a food inversion method based on multitask learning and attention mechanism, which comprises the following steps: step 1, collecting food data and constructing a menu data set; step 2, establishing and training a food material text model based on an attention mechanism, and obtaining a corresponding food material text by inputting a food picture; step 3, establishing and training a menu generation model, and obtaining a menu text corresponding to the food picture by inputting the food picture and the food material text; step 4, converting the food material texts and the menu texts into corresponding food material vectors and menu vectors respectively, and establishing and training a multitask convolution neural network model; and inputting a food picture to be detected in the multitask convolutional neural network model to further obtain the food classification, the calorie value, the food material vector and the menu vector corresponding to the food picture to be detected.

Description

Food inversion method based on multitask learning and attention mechanism
Technical Field
The invention relates to the technical field of image recognition, in particular to a food inversion method based on multi-task learning and attention mechanism.
Background
In recent years we have witnessed a number of outstanding achievements on the study of visual recognition tasks, including image classification, entity recognition and image semantic segmentation. However, compared with the general image recognition task, the food image understanding is faced with a more difficult challenge because the food and the food components thereof are subjected to various cutting and cooking operations, and the shape, texture, color and the like of the food are changed variously, and the food components in the dish are often shielded from each other. Thus, the challenges faced by food image analysis have surpassed the pure computer vision task.
One early food material identification model was PFD (paired local feature distribution), which uses the results of food material prediction for food classification. In PFD, based on the appearance of the image block, the pixel points are labeled as categories of food materials. The spatial relationship between the pixels is then modeled as a multi-dimensional histogram, featuring the tags co-occurrence of their geometric properties, e.g., distance and orientation. From these histograms, PFD showed impressive food recognition performance. However, PFD has almost no scalability in food material categories, only uses 8 categories, and obviously cannot meet the application requirements in real life when the food materials are diverse.
Recipe generation based on food images is designed as a retrieval task. By calculating the similarity of the food images in the embedding space, the system retrieves the corresponding menu from an existing data set. However, the performance of such systems is highly dependent on the amount and diversity of data of the retrieved data set, as well as the quality of the network-learned embedded vectors. Further, this system cannot retrieve recipe information other than the data set.
Regarding the estimation of food calories, the mainstream method at present is to predict the calories generated by food products according to the food category and the volume thereof. The food calorie estimation method based on the depth camera predicts the food quantity by taking a food picture through the depth camera so as to obtain the predicted calorie value of the food picture. However, the depth camera is a special device, and people have difficulty in using the depth camera in daily life.
DietCam is a mobile application that estimates calories of food from multiple pictures. The method carries out semantic segmentation and image recognition on food images, reconstructs the 3D volume of the food, and predicts the food calorie based on the 3D volume. The 3D reconstruction operation is performed by SIFT-based keypoint matching and homography estimation; the food calorie prediction system proposed by pouudzadhee et al requires that pictures be taken from both the top and side of the food item, with the user's thumb as a reference. The method estimates the volume of the food product by multiplying the height predicted from the top view image by the width predicted from the side view. The method for estimating the food volume by using a plurality of images usually needs to calibrate a camera or adjust a shooting angle, and has high operation difficulty for a user and complex flow.
The calorie value contained in the food mainly depends on the type, volume, food material and cooking method of the food. Sometimes the same category of food contains different calories because they use different ingredients and cooking methods. Therefore, the food calorie prediction task cannot be completely solved only by identifying the food category and volume, and the prediction accuracy rate is to be improved.
Disclosure of Invention
The invention designs and develops a food inversion method based on a multitask learning and attention mechanism, and aims to solve the problem that an indexing type menu generation model depends on a data set and the problem that the calorie prediction accuracy is low due to the fact that factors such as food materials and cooking methods of food are not considered.
The technical scheme provided by the invention is as follows:
a food inversion method based on multitask learning and attention mechanism comprises the following steps:
step 1, collecting food data and constructing a menu data set;
step 2, establishing and training a food material text model based on an attention mechanism, and obtaining a corresponding food material text by inputting a food picture;
step 3, establishing and training a menu generation model, and obtaining a menu text corresponding to the food picture by inputting the food picture and the food material text;
step 4, converting the food material texts and the menu texts into corresponding food material vectors and menu vectors respectively, and establishing and training a multitask convolution neural network model;
and inputting a food picture to be detected in the multitask convolutional neural network model to further obtain the food classification, the calorie value, the food material vector and the menu vector corresponding to the food picture to be detected.
Preferably, in the step 2, the process of establishing the text model of the food material through the Transformer model includes:
taking the characteristic vector of the food picture as input, and outputting the characteristic vector as a sequence L ═ (L) for generating food materials0,…,lk,…,lK) In the formula IkRepresents one food material in the sequence.
Preferably, in the step 2, the displaying the generated food material corresponding to the food picture by a list structure includes:
determining a dictionary containing N food material elements as
Figure BDA0002825095470000031
Selecting K elements from the dictionary D to generate a food material list
Figure BDA0002825095470000032
Encode L as a K × N dimensional binary matrix L when djWhen E is selected, L i,j1, otherwise Li,j=0;
The training data of the food material text model comprises M pairs of food material images and food material lists
Figure BDA0002825095470000033
The optimization target of the food material text model is
Figure BDA0002825095470000034
In the formula (I), the compound is shown in the specification,
Figure BDA0002825095470000035
for the target matrix predicted from image x, θIAnd thetaLLearnable parameters of an image encoder and a food material decoder, respectively;
will be provided with
Figure BDA0002825095470000036
Decomposition into K conditional sentences:
Figure BDA0002825095470000037
and specify
Figure BDA0002825095470000038
Probability distribution for food material classification.
Preferably, in the step 2, the food material text model is established through a Transformer model, and data optimization is performed through an Adam optimizer: set up beta1=0.9,β20.99, e 1e-8, set learning rate 0.001, where pre-training residual network stratigraphyThe learning rate is 0.0001; training the maximum training round of 200 rounds, setting the probability to be 50 by using an early stop method, and if the iou standard of the verification data is not improved after 50 rounds of training, executing early stop; where, batch _ size is set to 128 and num _ works is set to 4.
Preferably, in the step 3, the process of establishing a menu text model through a Transformer model includes:
taking the characteristic vector of the food picture and the characteristic vector of the food material text as input, and outputting a sequence R ═ R (R) for generating a menu1,…,rt,…,rT) In the formula, rtIs a word in the sequence.
Preferably, in the step 2, a recipe text model is established through a Transformer model, and data optimization is performed through an Adam optimizer: beta is a1=0.9,β2Setting the initial learning rate to be 0.001, wherein the epsilon is 1e-8, and the initial learning rate is attenuated once every ten turns, and the attenuation factor is 0.99; the maximum training round of the training is 200 rounds, an early stopping method is used, the probability is set to be 50, and if the iou standard of the verification data is not improved after 50 rounds of training, early stopping is executed; where, batch _ size is set to 128 and num _ works is set to 4.
Preferably, in the step 4, the establishing and training of the multitask convolutional neural network model comprises the following steps:
step 4.1, collecting sample data and constructing a training sample set and a verification test sample set;
step 4.2, building a multitask convolution neural network model,
step 4.3, obtaining a loss function of the training multitask convolution neural network model:
Figure BDA0002825095470000041
in the formula, Lcal,Lcat,Ling,LdirLoss functions of four tasks of calorie prediction, food classification, food material prediction and menu prediction are respectively, lambdacat,λingAnd λdirAre respectively asThe weight values of loss functions of food classification, food material prediction and recipe prediction are calculated, and N is the total number of learning data;
4.4, training the multitask convolution neural network model:
setting the initial training time, and initializing each weight value to 1;
inputting the picture characteristic vectors in the training sample set into the multitask convolutional neural network model to obtain the output of the multitask convolutional neural network model as a loss function between a food classification, a calorie value, a food material text vector and a menu text vector and a corresponding real value, and calculating the food classification, the calorie value, the food material text vector and the menu text vector, so that the loss is reduced to the minimum, and the training is stopped to obtain the trained multitask convolutional neural network model;
in the multi-task convolutional neural network model training process, the loss value of each iteration is saved, and finally the reciprocal of the average loss value of all iterations is used as the weight of each task loss function;
and 4.5, testing the prediction accuracy of the trained multitask convolution neural network model on a test data set.
Preferably, in the step 4.2, a VGG16 model is used as a basic network model for building the multitask convolutional neural network model.
Preferably, in said step 4.3, said calorie prediction loss function LcalIs composed of
Lcal=λreLreabLab
In the formula, LabAs absolute error, LreAs a relative error, λre,λabPredicting a weight value of the loss function for the calorie;
loss function L of said food classificationcatIs composed of
Figure BDA0002825095470000051
In the formula, ycat(k) Predicted value, g, of unit k for food image xcat(k) Is a binary value; when cell i is the correct value, g is setcat(k) 1 is ═ 1; when the cell i is not the correct value, g is setcat(k) 0, n is the number of food categories;
loss function L of the food material predictioningIs composed of
Figure BDA0002825095470000052
In the formula, ying(k) Is the k-th dimensional predicted value of the model, ging(k) Is k-dimension actual value;
loss function L of the menu vectoringIs composed of
Figure BDA0002825095470000053
In the formula, ydir(k) Is the k-th dimensional predicted value of the model, gdir(k) Is the k-dimension output value.
Preferably, the food material samples are respectively converted into corresponding food material vectors v _ ing through Word2vec modeljIs composed of
Figure BDA0002825095470000054
Where K is the number of food materials, word2vec (w)k) Is the corresponding w obtained by Word2VeckThe real number vector of (1), tfidfk,jIs a sample rjMiddle wkTf-idf value of;
respectively converting the menu samples into corresponding menu vectors v _ dir through a Word2vec modeljIs composed of
Figure BDA0002825095470000055
Where T is the number of menu words in the sample, word2vec (v)t) Is through WCorresponding v from ord2VectThe real number vector of (1), tfidfk,jIs a sample rjMiddle vtTf-idf value of (a).
Compared with the prior art, the invention has the following beneficial effects:
1. the method is based on a deep learning method, a generation-type menu prediction model is designed, the dependence of a traditional retrieval-type menu prediction model on a menu name-menu comparison data set is solved, namely when a menu name corresponding to a food image does not exist in a database, a trained model can also generate a reasonable menu text according to image information;
2. the method is based on the multitask convolutional neural network, so that the calorie value is directly predicted according to the food image, the food volume in the image does not need to be calculated first, and the accuracy of calorie prediction is effectively improved; special shooting equipment is not needed, and the complexity of the model and the use threshold of a user are reduced;
3. the method is based on the deep learning method, so that the corresponding menu text is directly generated according to the food image, a user does not need to input other auxiliary information, and the operation complexity of the user is reduced; in view of the fact that the food materials and cooking methods of the same dish in different regions are possibly different, compared with an indexing type recipe prediction model, the model can learn the differences according to food images, and the recipe generation accuracy is higher.
Drawings
Fig. 1 is a schematic diagram of a recipe generation model according to the present invention.
FIG. 2 is a schematic diagram of a calorie prediction model according to the present invention.
Fig. 3 is a flow chart of the general implementation of the system according to the present invention.
Fig. 4 is an overall framework diagram of the system according to the present invention.
FIG. 5 is a schematic diagram of a multitasking convolutional neural network according to the present invention.
FIG. 6 is a schematic view of a multi-modal attention model according to the present invention.
Fig. 7 is a schematic view of a food material encoder according to the present invention.
Fig. 8 is a schematic diagram of a menu decoder according to the present invention.
FIG. 9 is a schematic diagram of a calorie prediction model of the single-task convolutional neural network according to the present invention.
FIG. 10 is a schematic diagram of a calorie prediction model of the multitask convolutional neural network according to the present invention.
Detailed Description
The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.
The invention provides a food inversion system based on multitask learning and attention mechanism, which comprises the following steps:
step one, collecting food data and constructing a menu data set;
preferably, in the implementation, a web crawler technology is used to crawl 227310 data containing food images, food categories, contained food materials, corresponding recipes and calorie labels from a commercial recipe website; the three selected recipe websites are respectively as follows:
(1)http://allrecipes.com/;
(2)http://www.lettuceclub.net/recipe/;
(3)http://www.orangepage.net/;
in step one, cross-domain noise data and cross category noise counts are excluded by:
1. removing image corresponding samples smaller than 80 KB;
2. removing more than 8 or less than 3 samples in the recipe practice step;
3. removing food categories with the corresponding samples less than 100 in number, and removing the corresponding samples;
the invention standardizes the dish names and food material names of all samples: the main targets of the method are a fine-grained food classification task and a food material identification task, and the food categories and food material names are deleted or combined; for example, the present invention retains the different types of pasta and cake names in the dataset as a separate category, but removes the broader names "pasta" and "cake" from the dataset; therefore, the naming of food materials is not always consistent; for example, the food material "tomato" is replaced by "tomato";
preferably, in the present embodiment, the food material names are unified as shown in table 1;
TABLE 1 food material name UNIFIED TABLE
Figure BDA0002825095470000071
Figure BDA0002825095470000081
As a preferred, in the present invention, the data set contains 227310 samples in total, wherein 281 fine-grained food categories, 1520 food materials, 28552 menu words, and the unit of calorie value is kilocalories (kcal) per serving;
in the research, the size of the image is adjusted to be 256 multiplied by 256, and the image is randomly cut into 224 multiplied by 224 for training when the image is input into a model;
establishing and training an encoder-decoder model based on an attention mechanism to obtain food material texts corresponding to the food pictures;
in food material identification, an attention-based encoder-decoder model is adopted, and comprises an image encoder and a food material decoder:
the image encoder is used for encoding the food image into a feature vector; in the present implementation, as a preferable choice, a residual network of 50 layers is selected, and the network generated image coding dimension is 512;
the food material decoder is based on a Tensorflow 2.0.0 framework on a Windows 10 system, a coder-decoder model consisting of a Transformer module is constructed by using a Python3.5.6 programming language, and the image feature vectors generated by the image coder are decoded to generate food material texts.
The invention adopts a Transformer structure to convert a food image x(i)As input, the goal is to generate a sequence of food materials L ═ (L) through a model0,…,lk,…,lK) Wherein l iskRepresents one food material in the sequence.
The food material decoder consists of 4 Transformer modules and a softmax nonlinear layer, wherein each module comprises 2 attention (attention) layers and a linear layer; the first attention layer is used for carrying out self-attention calculation on the output of the last time step, and the second attention layer is used for adjusting the output of self-attention;
the invention adopts a list structure to represent the food material text corresponding to a food image, the length of the list is variable, and certain sequence relation exists among list elements.
In the next step, a dictionary containing N food material elements is defined as
Figure BDA0002825095470000091
By selecting K elements from the dictionary D, a list of food materials can be generated
Figure BDA0002825095470000092
Encode L as a K × N dimensional binary matrix L when djWhen E is selected, L i,j1, otherwise L i,j0; thus, the training data of the encoder-decoder model comprises M pairs of food image and food material list
Figure BDA0002825095470000093
The optimization goals of the model are:
Figure BDA0002825095470000094
wherein the content of the first and second substances,
Figure BDA0002825095470000095
for the target matrix predicted from image x, θIAnd thetaLLearnable parameters of an image encoder and a food material decoder, respectively. Since L refers to a list, can
Figure BDA0002825095470000096
Factorization into K conditional sentences:
Figure BDA0002825095470000097
and specify
Figure BDA0002825095470000098
Probability distribution for food material classification;
the image encoder and the food material decoder are trained together, and the probability distribution and the network model are trained and adjusted through an Adam optimizer; preferably, in this embodiment, an early-stop method is used, an early-stop monitoring index is set as a verification loss, and if the verification loss does not decrease within 50 rounds, the training is stopped;
the food material decoder consists of 4 Transformer modules and a softmax nonlinear layer, wherein each module comprises 2 attention (attention) layers and a linear layer; the first attention layer is used for carrying out self-attention calculation on the output of the last time step, and the second attention layer is used for adjusting the output of self-attention;
in the embodiment, the food material identification model is trained by using a self-constructed food data set; in the training process, the invention adopts data enhancement: performing random cropping (crop) and specified scaling (rescale) on the input sample image; the invention selects Adam optimizer (beta)1=0.9,β20.99, e 1e-8), setting a learning rate of 0.001, wherein the learning rate of the pre-training residual network layer is 0.0001; the maximum training round of the training is 200 rounds, an early stopping method is used, the probability is set to be 50, and if the iou standard of the verification data is not improved after 50 rounds of training, early stopping is executed; the batch _ size of the model training is set to 128, and num _ workers is set to 4;
step three, establishing and training a menu generation model so as to obtain a menu text corresponding to the food image to be detected;
based on the Tensorflow 2.0.0 framework on the Windows 10 system, the Python3.5.6 programming language is used to input the food image and the food material list as the model at the same time for generating the menu text.
The menu decoder encodes the food image eIAnd food material code eLAs input, eyeThe standard is that the menu sequence R is generated by a model (R)1,…,rt,…,rT) Wherein r istRefers to a word in the sequence;
obtaining a 512-dimensional food image vector e through the 50-layer residual error network encoder in the second stepI(ii) a In the second step, the food material decoder generates a sequence L ═ (L) of the food materials0,…,lk,…,lK) Then, an embedding layer can map the food material text into a 512-dimensional vector eL
The recipe decoder consists of 16 transform modules, each module containing 2 attention (attention) layers and one linear layer, and softmax nonlinear layers. The first attention layer is used for carrying out self-attention calculation on the output of the last time step, and the second attention layer is used for adjusting the output of self-attention;
the model contains two inputs: image feature vector
Figure BDA0002825095470000102
And food material coding
Figure BDA0002825095470000101
Wherein K is the number of food materials, deIs the dimension of the vector; the strategy adopted by the invention is to combine two attention layers to handle the problem of two modality input, one of which accepts the image encoding eIThe other layer receives the food material code eLThe outputs of the two attention layers are combined by a summation method.
In this embodiment, the present invention trains the encoder-decoder model using its own built set of food data; in the training process, the invention adopts data enhancement: performing random cropping (crop) and specified scaling (rescale) on the input sample image; adam optimizer (beta) was selected for this study1=0.9,β20.99, e 1e-8), setting an initial learning rate of 0.001, attenuating every ten rounds with an attenuation factor of 0.99; the maximum training round of the training is 200 rounds, an early stopping method is used, the probability is set to be 50, and if the iou standard of the verification data is not improved after 50 rounds of training, early stopping is executed; book (I)The batch _ size of the model training is set to 128, num _ works is set to 4;
establishing and training a multitask convolutional neural network model to further obtain the food category and the calorie value corresponding to the food image to be detected;
in the invention, the food calorie estimation is designed as a regression problem, namely, a food image is input, and a model outputs a corresponding calorie value;
according to the method, a given food image only contains one food, and one person is output when the prediction standard of the calorie of the food is standard, and as for the prediction task of food and materials, food material information is converted into Word vectors from Word2Vec and used for training a multi-task convolutional neural network model; in addition, in the menu prediction task, sentence texts of the practice step are also converted into vectors for model training.
The multi-task convolutional neural network architecture designed by the invention is mainly based on VGG16, and simultaneously trains the tasks of food calorie prediction, food classification, food material prediction and menu prediction; the fully connected layer (fc6) of the network is shared by all tasks, the transition layer (fc7) branching to each task, so each task has a transition layer (fc7) and an output layer (fc8), respectively;
set up Lc,Lcat,Ling,LdirSetting N for loss functions of four tasks of calorie prediction, food classification, food material prediction and menu prediction respectivelymulIs the total number of learning data; the loss function of the entire model is expressed as:
Figure BDA0002825095470000111
set up LabFor absolute loss, LreFor a relative loss, then LcalIs defined as:
Lcal=λreLreabLab
wherein λ isre,λab,λcat,λingAnd λdirRepresenting loss functions of four tasks respectivelyA weight value; when the model is trained for the first time, each weight value is initialized to 1, the loss value of each iteration is saved in the model training process, and finally the reciprocal of the average loss value of all iterations is used as the weight of each task loss function;
the food classification model connects 4096-dimensional fc7b layers and one fc8b, fc8b layer after fc6 full connection layer of vgg16 model, each cell corresponding to one food category.
Setting ycat(k) Setting g for the predicted value of the unit k of the food image xcat(k) Is a true value, then LcatIs defined as:
Figure BDA0002825095470000121
wherein, gcat(k) Is a binary value; when cell i is the correct value, g is setcat(k) 1 is ═ 1; when the cell i is not the correct value, g is setcat(k) 0, n is the number of food categories; in the present embodiment, as a preference, for example, when the food sorting task contains 20 kinds of food, n is set to 20.
The food calorie prediction task correspondence model comprises a 4096-dimensional fc7a layer and a one-dimensional fc8a output layer for outputting the predicted calorie value, and since the food calorie is a real value, the task is treated as a regression problem in which mse (mean square error) is generally selected as a loss function, and the present invention defines the loss function of the calorie prediction task as:
Lcal=λreLreabLab
wherein L isabAs absolute error, LreAs a relative error, λre,λabIs the weight value of the loss function; the absolute error is the absolute value of the difference between the predicted value and the actual value of the calorie; the relative error is the ratio of the absolute error to the actual value; since absolute error and relative error are important indexes, the invention uses LabAnd LreCombine to obtain LcalThe model is trainedFor both errors to be reduced simultaneously, set ycalFor the calorie prediction value of image x, g is setcalIs the actual calorie value, then LabAnd LreIs defined as:
Lab=|ycal-gcal|;
Figure BDA0002825095470000122
in order to solve the food material prediction task, Word2Vec is used in the research to convert each Word of the food material corresponding to a sample into a vector, and each recipe contains a plurality of food materials, so that the weighted sum value of all the obtained food material vectors is calculated, the food material information of the sample is represented, namely the linear combination of the Word2Vec vectors of all the food materials contained in the food is represented, and the calculated vector is used as the representation of the food material information; in the case of using a food material vector as training data, it is difficult to identify each independent food material contained in a food through a food image, and since the food material contained in the food is not identified through the food image, but the calorie contained in the food, the method is expected to improve the accuracy of food classification and food calorie prediction through a multitask convolutional neural network model, and obtain the effect of synchronous multitask learning, and therefore the food material prediction model is trained through the method.
The invention uses a Word2vec model pre-trained by a large food corpus for food data preprocessing, such as removing low-frequency words and performing secondary sampling on high-frequency words; in this embodiment, the present invention trains Word2Vec using the Skip-gram model and performing negative sampling;
for each sample, take only the top N of tf-idf valuesmaxFood material word of, NmaxAverage number of food material words for the sample; finally, a food material vector of one sample can be calculated according to the tf-idf value and the Word2Vec vector. Set up wkIs a sample rjThe food material word in (1), sample rjCorresponding food material vector v _ ingjIs defined as:
Figure BDA0002825095470000131
wherein K is the number of food materials, word2vec (w)k) Is the corresponding w obtained by Word2VeckThe real number vector of (1), tfidfk,jIs a sample rjMiddle wkTf-idf value of;
in the invention, the learning process of the food material information is converted into a food material vector prediction task; the task model comprises a 4096-dimensional fc7c layer and a dimension and food material vector dimension dIThe same output layer (fc8 c). Setting ying(k) Is the k-th dimensional predicted value of the model, ging(k) Is the k-th dimension actual value, then LingIs defined as:
Figure BDA0002825095470000132
in addition to food material prediction, the present study also used recipe (course of action) prediction as additional information for multitask learning. The same as the food material prediction task, each Word in the recipe sentence text is converted into a Word vector through Word2Vec, and then the corresponding recipe vector is obtained through weighted summation calculation. When the menu vector is generated, only nouns, verbs and adjectives in the menu sentence are used, and the words with higher tf-idf values are used. For each food sample, only the highest N of tf-idf values in the recipe statement is usedmaxWord, in experiment, NmaxThe average number of words contained in the menu text in each sample is set. And finally, calculating to obtain a menu vector of each sample according to the tf-idf weight value and the Word2Vec vector. Setting vtIs a sample rjWord, sample r of Chinese menu textjCorresponding menu vector v _ dirjIs defined as:
Figure BDA0002825095470000133
wherein T is the number of menu words in the sample, word2vec (v)t) Is thatCorresponding v obtained by Word2VectThe real number vector of (1), tfidfk,jIs a sample rjMiddle vtTf-idf value of (a).
The model is trained on the menu information through a menu vector prediction task, and the menu vector prediction task model consists of a 4096-dimensional fc7d layer and an output layer (fc8d), wherein the dimension of the output layer corresponds to the dimension of the menu vector. Setting ydir(k) Is the k-th dimensional predicted value of the model, gdir(k) Is the k-dimension output value, then LdirIs defined as:
Figure BDA0002825095470000141
the research expands a VGG-16 model and realizes a multitask convolutional neural network, and batch standardization is used for replacing dropout at an fc6 layer and an fc7 layer; at the other levels beyond batch normalization and fc8, its initial parameters were set to vgg16 model parameters pre-trained by ImageNet in a class 1000 classification task. To optimize the CNN parameters, the study used an SGD value of 0.9 and a small batch size of 8.
For the test, in 100 iterations, the invention obtained 10 models using the time interval of the last 1000 iterations in the training, and the average of the predicted values of each model was taken as the final predicted value.
The present invention uses 70% of the data in the food data set for training and the remaining 30% for validation and testing; the set learning rate of 0.001 was iterated 50000 times, then changed to 0.0001 learning rate for 20000 iterations. In order to train the model to predict the food material vector and the menu vector, the invention trains Word2Vec with sentences of about 8,710,000 cooking steps, and the dimension n of the Word vector is 500; regarding food ingredients, the invention only uses the food ingredient words with tf-idf values in the data set sample ranked at the top 12, and creates a food ingredient vector; since Nmax is 44, i.e. the average of the word number of the menu sentence in each sample is 44, the study only uses the word with the tf-idf value in the menu sentence in each sample in the top 44; then, in order to simply consider time information, the recipe text is divided into m sentences in time order, m recipe vectors are created, and finally the divided vectors are connected.
Examples
The system directly predicts the calorie value of the food through one food image, does not need other information manually input by a user, has no requirements on the image shooting angle and the like, does not need special equipment such as a depth camera and the like, and is simpler and more convenient for the user to operate.
As shown in fig. 9 and 10, the model adopts a multi-task convolutional neural network, and learns four tasks of calorie prediction, food classification, food material prediction and recipe prediction simultaneously during training, so that the accuracy of food classification and calorie prediction is effectively improved; the experimental result shows that compared with the calorie prediction model of the single-task convolutional neural network (the correlation coefficient is 0.7217), the correlation coefficient of the model (the correlation coefficient is 0.7679) provided by the invention is improved by 0.0462.
The invention constructs a food data set which comprises 281 fine-grained food categories and 1520 food materials, and can alleviate the problem of limitation of the food and food material categories to a certain extent.
The model takes the food image and the food material text as input simultaneously, so that the accuracy rate of the menu prediction is effectively improved; as shown in table 2, the experimental results show that the test set Perplexity (Perplexity) of the model is reduced by about 0.18 compared with the recipe prediction model with single food input; compared with the menu prediction model with food image single input, the model has the advantage that the confusion degree of the test set is reduced by about 1.40.
TABLE 2 test group perplexity test results
Model (model) Degree of confusion
Food material single input model 8.81
Image single input model 9.53
Generative model 8.65
Unlike traditional query-style recipe generation systems, which rely excessively on the data set for prediction accuracy, the present study designs the recipe generation task as a text generation problem.
As shown in table 3, the experimental result shows that, compared with the traditional query-type recipe prediction model, the Intersection-over-Union (IoU) of the food material identification task processed by the model is improved by more than 10, and the F1 score is improved by more than 15.
TABLE 3 results of experiments dealing with the intersection ratio of food material recognition task and F1
IoU F1
Image-menu query model 18.01 29.67
Image-food material inquiry type model 17.58 27.93
Generative model 30.59 45.12
As shown in table 4, compared with the conventional query recipe prediction model, the model processes the recipe generation task, and the accuracy (precision) and recall (recall) based on the food material information are significantly improved.
TABLE 4 Experimental results of accuracy and recall of food material information
Recall rate Rate of accuracy
Image-menu query model 29.83 27.62
Image-food material inquiry type model 28.75 29.16
Generative model 70.30 73.94
While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims (10)

1. A food inversion method based on multitask learning and attention mechanism is characterized by comprising the following steps:
step 1, collecting food data and constructing a menu data set;
step 2, establishing and training a food material text model based on an attention mechanism, and obtaining a corresponding food material text by inputting a food picture;
step 3, establishing and training a menu generation model, and obtaining a menu text corresponding to the food picture by inputting the food picture and the food material text;
step 4, converting the food material texts and the menu texts into corresponding food material vectors and menu vectors respectively, and establishing and training a multitask convolution neural network model;
and inputting a food picture to be detected in the multitask convolutional neural network model to further obtain the food classification, the calorie value, the food material vector and the menu vector corresponding to the food picture to be detected.
2. The multitask learning and attention mechanism-based food inversion method according to claim 1, wherein in the step 2, the process of establishing the food material text model through the Transformer model comprises the following steps:
taking the characteristic vector of the food picture as input, and outputting the characteristic vector as a sequence L ═ (L) for generating food materials0,…,lk,…,lK) In the formula IkRepresents one food material in the sequence.
3. The food inversion method based on multitask learning and attention mechanism according to claim 2, wherein in the step 2, the generating food material corresponding to the food picture is represented by a list structure, and the method comprises the following steps:
determining a dictionary containing N food material elements as
Figure FDA0002825095460000011
Selecting K elements from the dictionary D to generate a food material list
Figure FDA0002825095460000012
Encode L as a K × N dimensional binary matrix L when djWhen E is selected, Li,j1, otherwise Li,j=0;
The training data of the food material text model comprises M pairs of food material images and food material lists
Figure FDA0002825095460000013
The optimization target of the food material text model is
Figure FDA0002825095460000014
In the formula (I), the compound is shown in the specification,
Figure FDA0002825095460000015
for the target matrix predicted from image x, θIAnd thetaLLearnable parameters of an image encoder and a food material decoder, respectively;
will be provided with
Figure FDA0002825095460000021
Decomposition into K conditional sentences:
Figure FDA0002825095460000022
and specify
Figure FDA0002825095460000023
Probability distribution for food material classification.
4. The multitask-based learning and attention machine of claim 3The food inversion method is characterized in that in the step 2, a food material text model is established through a Transformer model, and data optimization is carried out through an Adam optimizer: set up beta1=0.9,β2Setting a learning rate of 0.001 for 0.99 and e to 1e-8, wherein the learning rate of the pre-training residual network layer is 0.0001; training the maximum training round of 200 rounds, setting the probability to be 50 by using an early stop method, and if the iou standard of the verification data is not improved after 50 rounds of training, executing early stop; where, batch _ size is set to 128 and num _ works is set to 4.
5. The multitask learning and attention mechanism-based food inversion method according to claim 1, wherein in the step 3, the process of establishing a menu text model through a Transformer model comprises the following steps:
taking the characteristic vector of the food picture and the characteristic vector of the food material text as input, and outputting a sequence R ═ R (R) for generating a menu1,…,rt,…,rT) In the formula, rtIs a word in the sequence.
6. The multitask learning and attention mechanism based food inversion method according to claim 3, wherein in the step 2, a recipe text model is established through a Transformer model, and data optimization is carried out through an Adam optimizer: beta is a1=0.9,β2Setting the initial learning rate to be 0.001, wherein the epsilon is 1e-8, and the initial learning rate is attenuated once every ten turns, and the attenuation factor is 0.99; the maximum training round of the training is 200 rounds, an early stopping method is used, the probability is set to be 50, and if the iou standard of the verification data is not improved after 50 rounds of training, early stopping is executed; where, batch _ size is set to 128 and num _ works is set to 4.
7. The multitask learning and attention mechanism based food inversion method according to claim 1, wherein in the step 4, establishing and training a multitask convolutional neural network model comprises the following steps:
step 4.1, collecting sample data and constructing a training sample set and a verification test sample set;
step 4.2, building a multitask convolution neural network model,
step 4.3, obtaining a loss function of the training multitask convolution neural network model:
Figure FDA0002825095460000024
in the formula, Lcal,Lcat,Ling,LdirLoss functions of four tasks of calorie prediction, food classification, food material prediction and menu prediction are respectively, lambdacat,λingAnd λdirRespectively weighing values of loss functions of food classification, food material prediction and menu prediction, wherein N is the total number of learning data;
4.4, training the multitask convolution neural network model:
setting the initial training time, and initializing each weight value to 1;
inputting the picture characteristic vectors in the training sample set into the multitask convolutional neural network model to obtain the output of the multitask convolutional neural network model as a loss function between a food classification, a calorie value, a food material text vector and a menu text vector and a corresponding real value, and calculating the food classification, the calorie value, the food material text vector and the menu text vector, so that the loss is reduced to the minimum, and the training is stopped to obtain the trained multitask convolutional neural network model;
in the multi-task convolutional neural network model training process, the loss value of each iteration is saved, and finally the reciprocal of the average loss value of all iterations is used as the weight of each task loss function;
and 4.5, testing the prediction accuracy of the trained multitask convolution neural network model on a test data set.
8. The multitask learning and attention mechanism based food inversion method according to claim 7, characterized in that in step 4.2, a VGG16 model is used as a base network model for building the multitask convolutional neural network model.
9. The multitask learning and attention mechanism based food inversion method as claimed in claim 7, characterized in that in step 4.3, the calorie prediction loss function LcalIs composed of
Lcal=λreLreabLab
In the formula, LabAs absolute error, LreAs a relative error, λre,λabPredicting a weight value of the loss function for the calorie;
loss function L of said food classificationcatIs composed of
Figure FDA0002825095460000031
In the formula, ycat(k) Predicted value, g, of unit k for food image xcat(k) Is a binary value; when cell i is the correct value, g is setcat(k) 1 is ═ 1; when the cell i is not the correct value, g is setcat(k) 0, n is the number of food categories;
loss function L of the food material predictioningIs composed of
Figure FDA0002825095460000032
In the formula, ying(k) Is the k-th dimensional predicted value of the model, ging(k) Is k-dimension actual value;
loss function L of the menu vectoringIs composed of
Figure FDA0002825095460000041
In the formula, ydir(k) Is the k-th dimensional predicted value of the model, gdir(k) Is the k-dimension output value.
10. The multitask learning and attention mechanism based food inversion method according to claim 7, wherein the food material samples are respectively converted into corresponding food material vectors v _ ing through a Word2vec modeljIs composed of
Figure FDA0002825095460000042
Where K is the number of food materials, word2vec (w)k) Is the corresponding w obtained by Word2VeckThe real number vector of (1), tfidfk,jIs a sample rjMiddle wkTf-idf value of;
respectively converting the menu samples into corresponding menu vectors v _ dir through a Word2vec modeljIs composed of
Figure FDA0002825095460000043
Where T is the number of menu words in the sample, word2vec (v)t) Is the corresponding v obtained by Word2VectThe real number vector of (1), tfidfk,jIs a sample rjMiddle vtTf-idf value of (a).
CN202011426511.1A 2020-12-09 2020-12-09 Food inversion method based on multitask learning and attention mechanism Active CN112488301B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011426511.1A CN112488301B (en) 2020-12-09 2020-12-09 Food inversion method based on multitask learning and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011426511.1A CN112488301B (en) 2020-12-09 2020-12-09 Food inversion method based on multitask learning and attention mechanism

Publications (2)

Publication Number Publication Date
CN112488301A true CN112488301A (en) 2021-03-12
CN112488301B CN112488301B (en) 2024-04-16

Family

ID=74940621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011426511.1A Active CN112488301B (en) 2020-12-09 2020-12-09 Food inversion method based on multitask learning and attention mechanism

Country Status (1)

Country Link
CN (1) CN112488301B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113791882A (en) * 2021-08-25 2021-12-14 北京百度网讯科技有限公司 Multitask deployment method and device, electronic equipment and storage medium
CN114898360A (en) * 2022-03-31 2022-08-12 中南林业科技大学 Food material image classification model establishing method based on attention and depth feature fusion
CN115422703A (en) * 2022-07-19 2022-12-02 南京航空航天大学 Surface thermal infrared emissivity inversion method based on MODIS data and Transformer network
CN117933250A (en) * 2024-03-22 2024-04-26 南京泛美利机器人科技有限公司 New menu generation method based on improved generation countermeasure network

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150169972A1 (en) * 2013-12-12 2015-06-18 Aliphcom Character data generation based on transformed imaged data to identify nutrition-related data or other types of data
US20170277863A1 (en) * 2016-03-24 2017-09-28 Anand Subra Real-time or just-in-time online assistance for individuals to help them in achieving personalized health goals
US20170301001A1 (en) * 2016-04-15 2017-10-19 Wal-Mart Stores, Inc. Systems and methods for providing content-based product recommendations
CN107563439A (en) * 2017-08-31 2018-01-09 湖南麓川信息科技有限公司 A kind of model for identifying cleaning food materials picture and identification food materials class method for distinguishing
US20180232689A1 (en) * 2017-02-13 2018-08-16 Iceberg Luxembourg S.A.R.L. Computer Vision Based Food System And Method
CN110059654A (en) * 2019-04-25 2019-07-26 台州智必安科技有限责任公司 A kind of vegetable Automatic-settlement and healthy diet management method based on fine granularity identification
AU2019100969A4 (en) * 2019-08-29 2019-10-03 Hongming Dai Chinese Food Recognition and Search System
CN110659420A (en) * 2019-09-25 2020-01-07 广州西思数字科技有限公司 Personalized catering method based on deep neural network Monte Carlo search tree
CN111276242A (en) * 2020-01-20 2020-06-12 吉林大学 Disease diagnosis and disease state evaluation modeling method for patients in intensive care unit of hospital
CN111429234A (en) * 2020-04-16 2020-07-17 电子科技大学中山学院 Deep learning-based commodity sequence recommendation method
CN111651674A (en) * 2020-06-03 2020-09-11 北京妙医佳健康科技集团有限公司 Bidirectional searching method and device and electronic equipment

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150169972A1 (en) * 2013-12-12 2015-06-18 Aliphcom Character data generation based on transformed imaged data to identify nutrition-related data or other types of data
US20170277863A1 (en) * 2016-03-24 2017-09-28 Anand Subra Real-time or just-in-time online assistance for individuals to help them in achieving personalized health goals
US20170301001A1 (en) * 2016-04-15 2017-10-19 Wal-Mart Stores, Inc. Systems and methods for providing content-based product recommendations
US20180232689A1 (en) * 2017-02-13 2018-08-16 Iceberg Luxembourg S.A.R.L. Computer Vision Based Food System And Method
CN107563439A (en) * 2017-08-31 2018-01-09 湖南麓川信息科技有限公司 A kind of model for identifying cleaning food materials picture and identification food materials class method for distinguishing
CN110059654A (en) * 2019-04-25 2019-07-26 台州智必安科技有限责任公司 A kind of vegetable Automatic-settlement and healthy diet management method based on fine granularity identification
AU2019100969A4 (en) * 2019-08-29 2019-10-03 Hongming Dai Chinese Food Recognition and Search System
CN110659420A (en) * 2019-09-25 2020-01-07 广州西思数字科技有限公司 Personalized catering method based on deep neural network Monte Carlo search tree
CN111276242A (en) * 2020-01-20 2020-06-12 吉林大学 Disease diagnosis and disease state evaluation modeling method for patients in intensive care unit of hospital
CN111429234A (en) * 2020-04-16 2020-07-17 电子科技大学中山学院 Deep learning-based commodity sequence recommendation method
CN111651674A (en) * 2020-06-03 2020-09-11 北京妙医佳健康科技集团有限公司 Bidirectional searching method and device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEIQING MIN 等: "Ingredient-guided cascade multi-attention network for food recognition", 《MM\'19: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》, 15 October 2019 (2019-10-15), pages 1331 - 1339 *
蔡芷薇: "基于多任务学习与注意力机制的食品识别模型研究", 《中国优秀硕士学位论文全文数据库 工程科技Ⅰ辑》, vol. 2022, no. 01, 15 January 2022 (2022-01-15), pages 024 - 303 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113791882A (en) * 2021-08-25 2021-12-14 北京百度网讯科技有限公司 Multitask deployment method and device, electronic equipment and storage medium
CN113791882B (en) * 2021-08-25 2023-10-20 北京百度网讯科技有限公司 Multi-task deployment method and device, electronic equipment and storage medium
CN114898360A (en) * 2022-03-31 2022-08-12 中南林业科技大学 Food material image classification model establishing method based on attention and depth feature fusion
CN114898360B (en) * 2022-03-31 2024-04-26 中南林业科技大学 Food material image classification model establishment method based on attention and depth feature fusion
CN115422703A (en) * 2022-07-19 2022-12-02 南京航空航天大学 Surface thermal infrared emissivity inversion method based on MODIS data and Transformer network
CN115422703B (en) * 2022-07-19 2023-09-19 南京航空航天大学 Surface thermal infrared emissivity inversion method based on MODIS data and transducer network
CN117933250A (en) * 2024-03-22 2024-04-26 南京泛美利机器人科技有限公司 New menu generation method based on improved generation countermeasure network

Also Published As

Publication number Publication date
CN112488301B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN110059198B (en) Discrete hash retrieval method of cross-modal data based on similarity maintenance
CN107516110B (en) Medical question-answer semantic clustering method based on integrated convolutional coding
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN109614471B (en) Open type problem automatic generation method based on generation type countermeasure network
CN112488301B (en) Food inversion method based on multitask learning and attention mechanism
CN112232087B (en) Specific aspect emotion analysis method of multi-granularity attention model based on Transformer
CN113435203B (en) Multi-modal named entity recognition method and device and electronic equipment
CN112667818B (en) GCN and multi-granularity attention fused user comment sentiment analysis method and system
CN111291188B (en) Intelligent information extraction method and system
CN111753189A (en) Common characterization learning method for few-sample cross-modal Hash retrieval
CN111737578A (en) Recommendation method and system
CN112148831B (en) Image-text mixed retrieval method and device, storage medium and computer equipment
CN114743020A (en) Food identification method combining tag semantic embedding and attention fusion
US11915343B2 (en) Color representations for textual phrases
CN111753116A (en) Image retrieval method, device, equipment and readable storage medium
Estevez-Velarde et al. AutoML strategy based on grammatical evolution: A case study about knowledge discovery from text
CN115482418B (en) Semi-supervised model training method, system and application based on pseudo-negative labels
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
Wei et al. Semantic pixel labelling in remote sensing images using a deep convolutional encoder-decoder model
CN111651594B (en) Case item classification method and medium based on key value memory network
CN108804544A (en) Internet video display multi-source data fusion method and device
CN115408551A (en) Medical image-text data mutual detection method, device, equipment and readable storage medium
Guo et al. Matching visual features to hierarchical semantic topics for image paragraph captioning
CN114356990A (en) Base named entity recognition system and method based on transfer learning
Lauren et al. A low-dimensional vector representation for words using an extreme learning machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant