CN114463740A

CN114463740A - Food nutrition assessment method and system based on visual analysis

Info

Publication number: CN114463740A
Application number: CN202210023062.9A
Authority: CN
Inventors: 李海生; 王薇; 董笑笑; 李楠; 李勇
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-05-10

Abstract

The invention relates to a food nutrition assessment method and a system based on visual analysis, wherein the method comprises the following steps: s1: acquiring RGB (red, green and blue) images and depth images of food before and after eating; s2: acquiring the category and the visual area of the RGB food image by using Mask R-CNN, and marking the food area in the depth image; s3: constructing and training a 3D convolutional neural network, inputting the marked depth image into the trained 3D convolutional neural network, and predicting to obtain a view depth image; s4: registering the depth image and the depth image of the opposite view into the same world coordinate to obtain a three-dimensional point cloud of food; s5: applying the point cloud to each food object, meshing the food objects using a convex hull-based algorithm to calculate a food volume; s6: and calculating the food mass and the nutrient amount according to the food volume and the food category, and subtracting the food nutrient amounts before and after eating to obtain the accurate food intake. The method provided by the invention effectively solves the problem of diet evaluation in daily life.

Description

Food nutrition assessment method and system based on visual analysis

Technical Field

The invention relates to the field of computer vision and computer graphics, in particular to a food nutrition assessment method and system based on visual analysis.

Background

A healthy diet and balanced nutrition are key to preventing life-threatening diseases such as obesity, cardiovascular disease and cancer. According to WHO statistics, children under 3900 ten thousand and 5 years old are overweight or obese in 2020. Fortunately, obesity and many chronic diseases can be prevented by dietary assessments, which can monitor daily food intake and control eating habits. Thus, diet assessment has become a focus of widespread attention in various fields of computer vision, medicine, nutrition, and health.

In nutritional epidemiology, detailed food information is needed to help the dietician assess the dietary behavior of the participant. Traditional dietary intake is typically assessed by techniques such as dietary records, 24-hour dietary recall (24HR) and Frequency of Food Questionnaires (FFQ). In the diet recording method, respondents record the consumption of each food and drink for one or more days. To accomplish this task, each respondent must receive detailed training to adequately describe the food and consumption, including the name of the food, the method of preparation, the formula and portion of the food mix, etc. Furthermore, 24-hour dietary recall is a typical method of measuring diet information for daily foods. The idea of this method is to list the daily food intake over 24 hours in a special format. However, it is not always easy for a person to remember the actual food content and the amount of food intake. In real life, it is difficult and in many cases not feasible to see experts every 24 hours. The food frequency method focuses on describing eating patterns or eating habits rather than caloric intake. It requires the interviewee to report the frequency with which they typically consume each food from a list of foods for a particular period. Information is collected on a frequency basis, but there is little detailed information about other characteristics of the food being eaten, such as the cooking method. The total nutrient intake estimate is obtained by adding the product of the reporting frequencies for each of all foods to the amount of nutrients in the designated (or assumed) food, resulting in an estimated daily intake of nutrients, dietary components, and food groups. In most cases, the goal of the food frequency method is to obtain a rough estimate of the total daily intake over a specified period of time. These conventional manual recording methods are complicated and cumbersome, and contain a large number of deviations and errors. Therefore, there is a need to develop objective diet assessment techniques to solve the problem of inaccurate and subjective measures.

With recent advances in Artificial Intelligence (AI), particularly computer vision and machine learning, a road has been paved for more powerful automatic meal assessment. With the widespread use of portable devices (e.g., smartphones) and the advancement of computer vision, there has been a proliferation of food monitoring applications based on automated food image processing. It not only relieves the burden of recording food, but also provides an immediate diet assessment, showing great potential in effective diet monitoring and control. It is emphasized that two important pieces of information, the name of the food and the volume of the food, required for food nutrition, are obtained from the food image. Existing methods have made great progress through image recognition techniques, but accurate and convenient estimation of food volume remains a challenge. The measurement techniques for food volume include model-based techniques and stereo-based techniques, among others. However, model-based techniques typically involve varying degrees of user intervention. The stereoscopic-based technology requires a user to take food images from multiple angles, increasing the user's burden. Thus, there is not yet a good way to conveniently and time-effectively monitor the dietary intake of a user.

Disclosure of Invention

In order to solve the technical problems, the invention provides a food nutrition assessment method and system based on visual analysis.

The technical solution of the invention is as follows: a method for food nutrition assessment based on visual analysis, comprising:

step S1: acquiring RGB images and depth images of food before and after eating, wherein the shooting angles of the RGB images and the depth images are kept consistent;

step S2: acquiring the food category and the visual area of the RGB image by using a Mask R-CNN neural network, and marking the corresponding food area in the depth image to obtain a marked depth image;

step S3: constructing and training a 3D convolutional neural network, inputting the marked depth image into the trained 3D convolutional neural network, and predicting to obtain a view depth image of the depth image; wherein the 3D convolutional neural network comprises: an initial layer, an encoder, a full link layer, and a decoder;

step S4: registering the depth image and the depth image of the opposite view into the same world coordinate to obtain a complete three-dimensional point cloud of a target object;

step S5: applying the point cloud to each tagged food object, meshing the food objects using a convex hull-based algorithm to calculate a food volume;

step S6: and calculating the food mass according to the food mass and the food category, comparing the food mass with a food nutrition table, calculating the food nutrition information before and after eating, and subtracting the food nutrition information from the food nutrition table to obtain the accurate food intake.

Compared with the prior art, the invention has the following advantages:

the invention discloses a food nutrition assessment method based on visual analysis, which can predict depth images of food at opposite visual angles, relieve the common problem of food occlusion in real life, reduce the burden of a user for shooting food images from multiple angles and effectively solve the problem of diet assessment in daily life.

Drawings

FIG. 1 is a flow chart of a method for food nutrition assessment based on visual analysis in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a 3D convolutional neural network structure according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of calculating the volume of food based on the convex hull algorithm according to the embodiment of the present invention;

fig. 4 is a block diagram of a food nutrition evaluation system based on visual analysis according to an embodiment of the present invention.

Detailed Description

The food nutrition assessment method based on visual analysis can predict the depth images of food at opposite visual angles, relieves the common problem of food occlusion in real life, reduces the burden of a user on shooting food images from multiple angles, and effectively solves the problem of diet assessment in daily life.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings.

Example one

As shown in fig. 1, a food nutrition assessment method based on visual analysis according to an embodiment of the present invention includes the following steps:

step S3: constructing and training a 3D convolutional neural network, inputting the marked depth image into the trained 3D convolutional neural network, and predicting to obtain a view depth image of the depth image; wherein, 3D convolution neural network includes: an initial layer, an encoder, a full link layer, and a decoder;

step S4: registering the depth image and the depth image of the opposite view into the same world coordinate to obtain a complete three-dimensional point cloud of the target object;

step S5: applying a point cloud to each marked food object, meshing the food objects using a convex hull-based algorithm to calculate a food volume;

step S6: and calculating the food mass according to the food mass and the food category, comparing the food mass with a food nutrition table, calculating the food nutrition information before and after eating, and subtracting the food nutrition information before and after eating to obtain the accurate food intake.

In one embodiment, the step S1: the method comprises the following steps of acquiring RGB images and depth images of food before and after eating, wherein the shooting angles of the RGB images and the depth images are kept consistent, and the method specifically comprises the following steps:

the user selects a portable device with a depth sensor or depth camera to obtain RGB food images and depth images before and after eating. The user can shoot from any angle and does not need to place a reference card (or a reference mark) beside the food, thereby greatly reducing the burden of the user. In the process, the shooting angles of the RGB image and the depth image are required to be kept consistent.

In one embodiment, the step S2: acquiring the food category and the visual area of the RGB image by using a Mask R-CNN neural network, marking the corresponding food area in the depth image to obtain a marked depth image, which specifically comprises the following steps:

this step is related to food recognition and food segmentation, so after acquiring the food image, a visual analysis of the food is first performed. The key to visual analysis is to obtain a compact and expressive representation of features. The embodiment of the invention adopts a general object example segmentation framework Mask R-CNN and simultaneously processes food segmentation and identification. It is an extended form of Faster R-CNN, which can effectively detect objects in an image, and also can generate a high-quality segmentation mask for each instance. Thereby, food category and food visual area information can be acquired.

Firstly, inputting an RGB image into an ROI Align layer of a Mask R-CNN, mapping the ROI to a corresponding position, then, rounding up, outputting according to a set size, dividing an original image area into different parts (sections), and passing through a maximum pooling layer according to a finally generated part. In the embodiment of the invention, the ROI Align layer uses 54 layers of standard residual error neural networks (ResNet) as a feature extractor for extracting the Section features. And inputting the extracted features of different sections into two convolution layers in sequence, extracting the Section features again, wherein convolution kernels are respectively 3X 3 and 1X 1, a branch is inserted into the convolution layer with the convolution kernel of 3X 3 and used for calculating a classification frame, and a Feature Pyramid (FPN) is connected behind the second convolution layer to serve as a structure of a prediction segmentation mask. And finally obtaining the food category and the food visual area information.

In addition, the corresponding food area in the depth image needs to be marked to obtain a marked depth image.

Since the food image is photographed at a single angle, a food blocking phenomenon inevitably occurs. In detail, only the front side of the food is visible and the back side shape of the food is not accessible. To solve this problem, the embodiment of the present invention designs a 3D convolutional neural network for predicting the image of the back of the food.

As shown in the schematic structural diagram of the 3D convolutional neural network of fig. 2, in one embodiment, the step S3: constructing and training a 3D convolutional neural network, inputting the marked depth image into the trained 3D convolutional neural network, and predicting to obtain a view depth image of the depth image; wherein, 3D convolution neural network includes: initial layer, encoder, full connection layer and decoder specifically include:

step S31: constructing a 3D convolutional neural network, comprising:

initial layer: the convolution layers with different convolution kernel sizes are connected together to form the convolution device; in the embodiment of the invention, 3 initial layers are used for processing the input image. Each initial layer is formed by connecting 4 convolutional layers with different convolutional kernel sizes; the initial layer is intended to process food objects acquired at different distances. Since food objects may vary in size, an initial layer with convolution kernels of different sizes can capture the details of the food object in a more convenient and efficient manner.

An encoder: the convolution layer is composed of a plurality of convolution layers with different convolution kernels; in the embodiment of the invention, 1 convolution layer of 3 × 3 convolution kernels and 3 convolution layers of 2 × 2 convolution kernels are connected together to be used as an encoder;

a plurality of fully-connected layers: sharing the feature through a full connection layer, and aligning feature dimensions; the embodiment of the invention shares the characteristics of the vector characteristics output by the encoder through 2 full-connection layers, and aligns the characteristic dimensions;

a decoder: the encoder is inverted and then is combined with a plurality of convolution layers to form the encoder; as shown in fig. 2, the decoder is formed by inverting the encoder and adding 2 convolution kernels of 3 × 3 and 1 convolution kernel of 5 × 5.

The 3D convolutional neural network of an embodiment of the present invention inputs a depth image of size 480 x 640 and outputs a depth image of a pair of views captured at opposite perspectives corresponding to the input image of 480 x 640.

Step S32: training a 3D convolutional neural network by using the disclosed two-dimensional view and a corresponding three-dimensional model data set to obtain a trained 3D convolutional neural network; the 3D convolutional neural network outputs a depth image of the opposite side shielding angle of the input image, namely an opposite view depth image; the loss function that defines a 3D convolutional neural network is shown in equation (1):

wherein the content of the first and second substances,

for the pixel value of the view depth image, d (u, v) is the pixel value of the depth image, w and h respectively represent the width and height of the image, λ is a regularization term in a loss function, and b is an offset;

the embodiment of the invention trains the 3D convolutional neural network constructed in the step S31 by using the disclosed two-dimensional view and the corresponding three-dimensional model data set, and simultaneously utilizes the loss function of the formula (1) to control the convergence speed of the network, thereby finally obtaining the trained 3D convolutional neural network.

Step S33: and inputting the marked depth image into a trained 3D convolutional neural network to generate a depth image of the opposite view of the depth image.

And inputting the marked depth image obtained in the step S2 into a trained 3D convolutional neural network to generate a view-to-view depth image of the depth image.

After obtaining the depth image of the food and its depth image of the view, its corresponding three-dimensional point cloud can be calculated. The existing volume estimation method obtains an external calibration matrix through a reference mark, fuses a synthetic point and an initial point, and obtains a complete point cloud through a transformation matrix. The embodiment of the invention provides a method for reconstructing three-dimensional point cloud without using a reference mark, and the reference mark is not required to be additionally arranged.

In one embodiment, the step S4: registering the depth image and the depth image of the opposite view into the same world coordinate to obtain a complete three-dimensional point cloud of the target object, which specifically comprises the following steps:

step S41: moving the position of the origin of the world coordinates to the center of the camera that captured the depth image, re-projecting the depth image into the world coordinates of the following equation (2):

wherein u, v represent coordinates in the depth image, and X, Y, Z represent coordinates in world coordinates; z is a scalar, representing depthmap (u, v);

is a camera matrix, where f_x,f_yA parameter indicative of a focal length; c. C_x,c_yIs the principal point offset, i.e., the position of the principal point position relative to the image plane (projection plane);

step S42: by performing 180 degrees camera rotation and translation, respectively, through the rotation and translation matrices, it can be simplified to the following equation (3):

where θ is the angle of rotation of the camera along the y-axis;

step S43: registering to the same world coordinates by the following equation (4), a food point cloud is synthesized:

wherein the content of the first and second substances,

in order to be a matrix of rotations,

is a translation matrix.

In one embodiment, the step S5: applying a point cloud to each marked food object, meshing the food objects using a convex hull-based algorithm to calculate a food volume, comprising:

step S51: layering food point clouds from bottom to top according to preset equal intervals, and storing each layer of point cloud as an independent unit;

in the embodiment of the invention, the layering equal interval is set to be 5 cm.

Step S52: performing z-axis projection on the layered food point clouds, and then performing outer convex hull construction on each layer of food point cloud by using a convex hull algorithm to obtain convex hull outlines;

step S53: setting a side length threshold L_limitFor convex hull contours and each being greater than a variable length threshold L_limitThe diameter of the edge of the circle is taken as a circular area, and points in the circle are selected as suspected boundary points; finding a point which forms a maximum angle with the diameter end point in the suspected boundary points as a new boundary point, so as to shrink the food boundary and eliminate gaps;

step S54: repeating the steps S52-S53 until all the side lengths are less than the threshold value, and stopping iterative computation;

step S55: the volume of each layer is calculated and the volume of all layers is added to obtain the volume of the food.

As shown in fig. 3, a schematic flow chart for calculating the volume of food based on the convex hull algorithm is shown.

In one embodiment, the step S6: and calculating the food mass according to the food volume and the food category, comparing the food mass with a food nutrition table, calculating the food nutrition amount before and after eating, and subtracting the food nutrition amount before and after eating to obtain the accurate food intake.

From the food type obtained in step S2 and the food volume obtained in step S5, the food mass M of the food item can be calculated according to formula (5):

M＝ρV (5)

where V represents the indicated food volume and ρ is the food density, available from a food density table;

depending on the food mass M, nutritional information in the food may be determined from a food nutrition table as shown in the food nutrition table example of table 1 below; the food nutrient amount N is calculated using the following formula (6):

wherein N is_TAnd M_TRespectively, the food nutrition and food quality looked up from the table.

Table 1: food nutrition table example

After the food nutrition amount before and after meal is calculated, the actual food intake amount of the user can be obtained by subtracting the food nutrition amount before and after meal.

Example two

As shown in fig. 4, an embodiment of the present invention provides a food nutrition evaluation system based on visual analysis, including the following modules:

an RGB image and depth image obtaining module 71, configured to obtain RGB images and depth images of food before and after eating, where shooting angles of the RGB images and the depth images are kept consistent;

a food category and visual area obtaining module 72, configured to obtain a food category and a visual area of the RGB image by using a Mask R-CNN neural network, and mark a corresponding food area in the depth image to obtain a marked depth image;

a depth image to look prediction module 73, configured to construct and train a 3D convolutional neural network, input the labeled depth image into the trained 3D convolutional neural network, and predict a depth image to look at of the depth image; wherein the 3D convolutional neural network comprises: an initial layer, an encoder, a full link layer, and a decoder;

a food three-dimensional point cloud obtaining module 74, configured to register the depth image and the depth image of the opposite view into the same world coordinate, so as to obtain a complete three-dimensional point cloud of a target object;

a calculate food volume module 75 for applying the point cloud to each marked food object, the food objects being gridded using a convex hull based algorithm to calculate a food volume;

and a food intake calculation module 76 for calculating the food mass according to the food volume and the food category, comparing the food mass with a food nutrition table, calculating the food nutrition information before and after eating, and subtracting the two to obtain the accurate food intake.

The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. A method for food nutrition assessment based on visual analysis, comprising:

step S2: acquiring the food category and the visual area of the RGB image by using a MaskR-CNN neural network, and marking the corresponding food area in the depth image to obtain a marked depth image;

step S3: constructing and training a 3D (three-dimensional) convolutional neural network, inputting the marked depth image into the trained 3D convolutional neural network, and predicting to obtain a depth image of the opposite view of the depth image; wherein the 3D convolutional neural network comprises: an initial layer, an encoder, a full link layer, and a decoder;

step S6: and calculating the food mass according to the food volume and the food category, comparing the food mass with a food nutrition table, calculating the food nutrition amount before and after eating, and subtracting the food nutrition amount before and after eating to obtain the accurate food intake.

2. The visual analysis-based food nutrition assessment method according to claim 1, wherein said step S3: constructing and training a 3D convolutional neural network, inputting the marked depth image into the trained 3D convolutional neural network, and predicting to obtain a view depth image of the depth image; wherein the 3D convolutional neural network comprises: initial layer, encoder, full connection layer and decoder specifically include:

step S31: constructing a 3D convolutional neural network, comprising:

initial layer: the convolution layers with different convolution kernel sizes are connected together to form the convolution device;

an encoder: the convolution layer is composed of a plurality of convolution layers with different convolution kernels;

a plurality of fully-connected layers: sharing features through the full connection layer, and aligning feature dimensions;

a decoder: the encoder is inverted and then is combined with a plurality of convolution layers to form the encoder;

step S32: training the 3D convolutional neural network by using the disclosed two-dimensional view and the corresponding three-dimensional model data set to obtain a trained 3D convolutional neural network; the 3D convolutional neural network outputs a depth image of the opposite side shielding angle of the input image, namely an opposite view depth image; the loss function that defines the 3D convolutional neural network is shown in equation (1):

wherein, the first and the second end of the pipe are connected with each other,

for the pixel values of the pair-view depth image, d (u, v) is the pixel values of the depth image, w and h represent the width and height of the image, respectively, λ is a regularization term in a loss function, and b is an offset;

step S33: and inputting the marked depth image into the trained 3D convolutional neural network to generate a view-to-view depth image of the depth image.

3. The visual analysis-based food nutrition assessment method according to claim 1, wherein said step S4: registering the depth image and the depth image of the opposite view into the same world coordinate to obtain a complete three-dimensional point cloud of the target object, wherein the method specifically comprises the following steps:

step S41: moving the position of the origin of the world coordinates to the center of the camera that captured the depth image, re-projecting the depth image into the world coordinates of equation (2) below:

where θ is the angle of rotation of the camera along the y-axis;

wherein the content of the first and second substances,

in order to be a matrix of rotations,

is a translation matrix.

4. The visual analysis-based food nutrition assessment method according to claim 1, wherein said step S5: applying the point cloud to each marked food object, meshing the food objects using a convex hull-based algorithm to calculate a food volume, comprising:

step S51: layering the food point clouds from bottom to top according to preset equal intervals, and storing each layer of point cloud as an independent unit;

step S52: performing z-axis projection on the layered food point clouds, and then performing outer convex hull construction on each layer of the food point clouds by using a convex hull algorithm to obtain convex hull outlines;

step S53: setting a side length threshold L_limitFor the convex hull contour and each larger than the variable length threshold L_limitThe diameter of the edge of the circle is taken as a circular area, and points in the circle are selected as suspected boundary points; finding a point which forms a maximum angle with the diameter end point in the suspected boundary points as a new boundary point, so as to shrink the food boundary and eliminate gaps;

5. A food nutrition assessment system based on visual analysis, comprising the following modules:

the food processing device comprises an RGB image and depth image acquisition module, a food processing module and a display module, wherein the RGB image and depth image acquisition module is used for acquiring RGB images and depth images of food before and after eating, and the shooting angles of the RGB images and the depth images are kept consistent;

the food category and visual region acquisition module is used for acquiring the food category and the visual region of the RGB image by using a MaskR-CNN neural network, and marking the corresponding food region in the depth image to obtain a marked depth image;

the prediction view depth image module is used for constructing and training a 3D (three-dimensional) convolutional neural network, inputting the marked depth image into the trained 3D convolutional neural network, and predicting to obtain a view depth image of the depth image; wherein the 3D convolutional neural network comprises: an initial layer, an encoder, a full link layer, and a decoder;

the food three-dimensional point cloud obtaining module is used for registering the depth image and the depth image of the opposite view into the same world coordinate to obtain a complete three-dimensional point cloud of a target object;

a calculate food volume module to apply the point cloud to each marked food object, the food objects being gridded using a convex hull based algorithm to calculate a food volume;

and the food intake calculation module is used for calculating the food mass according to the food mass and the food category, comparing the food mass with a food nutrition table, calculating food nutrition information before and after eating, and subtracting the food nutrition information from the food nutrition table to obtain the accurate food intake.