CN114463346A

CN114463346A - Complex environment rapid tongue segmentation device based on mobile terminal

Info

Publication number: CN114463346A
Application number: CN202111583094.6A
Authority: CN
Inventors: 黄宗海; 温川飙; 宋海贝
Original assignee: Chengdu University of Traditional Chinese Medicine
Current assignee: Chengdu University of Traditional Chinese Medicine
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-05-10

Abstract

The invention relates to the field of medical image processing, in particular to a complex environment rapid tongue segmentation device based on a mobile terminal, which executes the following steps: s1, acquiring a test image including a tongue body; s2, inputting the test image into the well-trained OET-NET model, and outputting a tongue image segmented from the test image; the OET-NET model has the following architecture: the skip-join in the U-Net model is replaced with a residual soft-join module and a saliency map fusion module is added after each level of the U-Net model and the penalty function is replaced with a pixel-weighted cross-entropy penalty function. The lightweight OET-NET model can be well deployed on various mobile devices and has good expandability in subsequent research of mobile intelligent tongue diagnosis.

Description

Complex environment rapid tongue segmentation device based on mobile terminal

Technical Field

The invention relates to the field of medical image processing, in particular to a complex environment rapid tongue segmentation device based on a mobile terminal.

Background

The intelligent medical algorithm model is applied to the mobile terminal equipment, so that the user can conveniently master the health information in real time. In the intelligent tongue diagnosis of the mobile terminal device, the tongue body pictures uploaded by the user are collected by different devices, the resolution is different, and the illumination and environment information are different. The user can not provide high-quality tongue picture, and a plurality of tongue objects are mixed in the picture to influence the segmentation accuracy. Meanwhile, the mobile-end intelligent tongue diagnosis system requires less space occupation and less time overhead. Therefore, in addition to segmentation accuracy, the inference duration and the total number of model parameters are also considered. In fig. 1, six tongue segmentation cases are shown, in which accurate segmentation is difficult to achieve, including: (a) presence of tongue-like image information in the environment; (b) tongue with special coating color and coating quality; (c) overexposed tongue images; (d) a tongue image with dark illumination; (e) the proportion of the tongue in the image is too large; (f) the tongue is too small in proportion in the image. When the mobile terminal device is used for tongue image acquisition, the situation cannot be ignored, and targeted processing is needed.

Tongue segmentation can be currently divided into four methods, namely threshold segmentation, template matching, region segmentation and neural network. The threshold segmentation firstly defines the brightness of a color image, then uses an Otus method to binarize the image, removes a non-target area image, and finally uses a mathematical morphology method to obtain a final segmentation result. The method has strict requirements on the environment and is difficult to acquire images meeting the conditions by depending on mobile terminal equipment.

Template matching is the most common method for tongue segmentation of mobile terminal devices, and Li et al have proposed tongue segmentation using template matching in the 2009 industrial electronics society, but it also has certain drawbacks. It requires the initial positioning of the tongue position and the user to place the tongue completely within the pre-set template. The method solves the problems of acquisition of fixed distance of the tongue body and interference of the tongue-like object in the complex environment to a certain extent, but the method needs dynamic optimization of an energy function to approximate the real contour of a target. In the optimization process, the set curve is complex, so that the calculated amount is overlarge, and even if the curve complexity is reduced by converting into different color spaces, the requirement of quickly dividing the tongue body is difficult to achieve. In addition, the accuracy of template matching depends on the setting of the initial template, and a poor template easily causes the algorithm to be difficult to converge.

The region segmentation algorithm is often used in combination with a template matching algorithm, and an initial template is set by region segmentation and then an energy function is matched by the template to make a curve approach a target. The method is too sensitive to noise, and is easy to lose important contours of low-contrast images, so that over-segmentation is caused. Due to the influence of environment and equipment factors, the image acquired by the mobile terminal cannot well meet the requirements of the region segmentation algorithm.

The neural network better solves the influence of environmental factors on tongue segmentation by learning different dimensional characteristics. In the early stage, Li is blended into a trained algorithm through transfer learning, so that the accuracy of tongue segmentation is improved. Recently, the neural network tongue segmentation is to improve the fixed model more to obtain higher segmentation accuracy. Zhou et al have obtained an end-to-end model using a void convolution approach to achieve good tongue segmentation results. Huang et al use codec in combination with a full convolutional neural network to achieve accurate tongue segmentation. Zhou et al used a modified U-Net model to morphologically segment the tongue. In recent years, the improved segmentation method using ResNet and Deeplab v3 has achieved good results in the field of tongue segmentation. However, in the above method, most of the used tongue picture data sets are acquired in a standard environment, and the robustness is low. Some too deep networks have good segmentation effect, but the parameter quantity is too large, the inference time is too long, and the availability of the network at the mobile end is reduced.

Disclosure of Invention

The invention aims to realize the acquisition and judgment of tongue images on mobile equipment, overcome the problems of more image noise and less effective information obtained from the mobile equipment in the prior art, and overcome the defect that a tongue image data set of a mobile terminal is difficult to be fully matched with the training amount of the existing model, improve the training model, and provide a complex environment rapid tongue segmentation device based on a mobile terminal based on light-weight U-Net.

In order to achieve the above purpose, the invention provides the following technical scheme:

a complex environment fast tongue segmentation device based on a mobile terminal, the device performs the following steps:

s1, acquiring a test image including a tongue body;

s2, inputting the test image into the well-trained OET-NET model, and outputting a tongue image segmented from the test image; the OET-NET model has the following architecture: and replacing a residual connecting module in the U-Net model with a residual soft connecting module, adding a saliency map fusion module after each convolution layer of the U-Net model, and replacing a loss function with a pixel-weighted cross-entropy loss function.

Preferably, the OET-NET model comprises an encoding module, a residual error soft connection module, a decoding module and a saliency map fusion module,

the coding module performs convolution and pooling operation on the input picture for multiple times to obtain a tongue picture high-dimensional characteristic diagram;

the residual error soft connection module fuses the initial feature map of each stage with the coded feature map; then connecting a flexible connection coefficient rolling block of 1x1, and finally fusing the noise-reduced feature block with the high-level features;

the decoding module continuously samples and restores original features of the coded feature map until the original features are consistent with the resolution of the input image;

the saliency map fusion module feeds supervisory information into the middle layer.

Preferably, the pixel-weighted cross entropy loss function is:

n is the training set size, y_iLabel diagram, p (y), representing the ith diagram_i) The prediction graph for the ith graph is shown. Alpha is alpha_iFor the penalty weight, the value is different according to the different proportions of the tongue body, and the specific calculation is as follows:

S_ithe tongue body in the ith picture accounts for the proportion of the whole picture; epsilon is a preset proportion threshold value to distinguish an image with small tongue body proportion from an image with large tongue body proportion; beta is a penalty factor used to define penalty weights.

Preferably, the resolution of the test image of the tongue includes, but is not limited to: 320 × 240, 1920 × 1440, 3000 × 4000, 3840 × 5120, 6000 × 8000.

Preferably, the penalty weight in the OET-NET model training process is set to be 100, and the threshold value is set to be 0.05.

Compared with the prior art, the invention has the beneficial effects that:

1. the device of the invention stores a rapid tongue segmentation algorithm model, which improves the accuracy of U-Net segmentation from the perspective of context feature fusion and multiresolution feature fusion on the basis of lightweight U-Net. In the aspect of context feature fusion, a skip-connection module of the U-Net is replaced by a residual soft connection module so as to reduce noise directly transmitted to a decoding module through skip connection. In the aspect of multi-resolution feature fusion, multi-resolution salient images are obtained from the bottom layer to the top layer through salient image fusion, and effective information of each layer is reserved through fusion in the last layer. The mobile terminal device has certain limitations in terms of memory. The lightweight OET-NET model can be well deployed on various mobile devices, and has good expandability in the follow-up research of the mobile-end intelligent tongue diagnosis.

2. In addition, the position and size of the tongue vary greatly among tongue images acquired by different devices, easily resulting in over-fitting of training data for small data volumes. In the invention, a new focus loss function is established according to the tongue body proportion in each image, and the target segmentation area of the image with lower tongue body proportion is weighted to increase the loss value and strengthen the attention to the target area. In order to better retain the segmentation information of each resolution, the loss value of the saliency map corresponding to each layer is calculated and the overall model is optimized.

3. By comparing the model of the present invention with other models, a three-dimensional map is obtained, as shown in fig. 2. Fig. 2 visually shows the results of comparing the OET-NET model with other indexes. In the OET-NET tongue segmentation model, the size of the parameter is 7.75MB, the time length for processing image data is 59 ms/piece, and the MIoU is 96.98%, so that huge competitiveness is shown. The model can deduce the segmentation result in a short time and save a lot of time when applied to the mobile terminal. The fast reasoning speed of the tongue segmentation model of the mobile terminal equipment is the basis of good user experience. Segmentation results of tongue image data under different devices and environments outperformed all models verified.

Description of the drawings:

FIG. 1 is a diagram illustrating six situations in tongue segmentation in the prior art where accurate segmentation is difficult to achieve;

FIG. 2 is a three-dimensional image comparing the segmentation effect, model parameter magnitude and inference time of the OET-NET of the present invention with other segmentation models;

fig. 3 is a flowchart illustrating steps executed by the apparatus for complex environment fast tongue segmentation based on the mobile terminal according to embodiment 1 of the present invention;

FIG. 4 is a schematic structural diagram of an OET-NET model in example 1 of the present invention;

fig. 5 is a schematic structural diagram of a residual error soft-link module in embodiment 1 of the present invention;

fig. 6 is a schematic diagram of a saliency map fusion module in embodiment 1 of the present invention;

fig. 7 is a comparison diagram of the segmented tongue image in the natural environment obtained after the complex environment fast tongue segmentation apparatus based on the mobile terminal performs steps S1 and S2 according to embodiment 1 of the present invention, and the segmented tongue image output by other models;

fig. 8 is a three-dimensional graph in which MIoU is used as an index of segmentation accuracy, the number of model parameters is used as an index of model size, and the time for segmenting a picture is used as an index of inference speed in embodiment 1 of the present invention;

FIG. 9 is a graph showing the loss variation per batch using the OET-NET module in example 2 of the present invention;

FIG. 10 is a first example of the actual segmentation result of each model in embodiment 2 of the present invention;

FIG. 11 is a second example of the actual segmentation result of each model in embodiment 2 of the present invention;

FIG. 12 is a third example of the actual segmentation result of each model in embodiment 2 of the present invention;

fig. 13 is an example four of the actual segmentation result of each model in embodiment 2 of the present invention;

FIG. 14 is a fifth example of the actual segmentation result of each model in embodiment 2 of the present invention;

fig. 15 shows three actual segmentation results of different mobile-end devices in embodiment 2 of the present invention;

FIG. 16 is a first diagram illustrating the actual segmentation effect of the corresponding module in the ablation experiment in example 2 of the present invention;

FIG. 17 is a second diagram of the actual segmentation effect of the corresponding module in the ablation experiment in embodiment 2 of the present invention;

FIG. 18 is a third diagram of the actual segmentation effect of the corresponding module in the ablation experiment in embodiment 2 of the present invention;

FIG. 19 is a diagram of the actual segmentation effect of the corresponding module in the ablation experiment in example 2 of the present invention;

fig. 20 is a graph five showing the actual segmentation effect of the corresponding module in the ablation experiment in embodiment 2 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to test examples and specific embodiments. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.

Example 1

The invention provides a complex environment rapid tongue segmentation device based on a mobile terminal, and a flow chart of steps executed by the device is shown in figure 3, and the executed steps comprise:

s1, a test image including the tongue body is acquired.

S2, inputting the test image into the well-trained OET-NET model, and outputting a tongue image segmented from the test image; the OET-NET model has the following architecture: a skip-connection module in the U-Net model is replaced with a residual soft-connection module, and a saliency map fusion module is added after each convolution layer of the U-Net model, and the loss function is replaced with a pixel-weighted cross-entropy loss function.

Wherein, step S1 specifically includes: acquiring a tongue image shot by the mobile device under natural conditions and representing the tongue image as I. The image background and the distance from the shooting device are free from constraint, and the tongue body image under any environmental condition can be input. The image input into the coding module can be shot by any equipment and can also be uploaded by a memory card.

The structure of the OET-NET model in step S2 is shown in fig. 4, and mainly includes four modules, namely, an encoding module, a residual soft-link module, a decoding module, and a saliency map fusion module.

The coding module:

the coding module firstly obtains an image of a tongue body shot by the mobile equipment under natural conditions and represents the image as I. The image background and the distance from the shooting device are free from constraint, and the tongue body image under any environmental condition can be input. The image input into the coding module can be shot by any equipment and can also be uploaded by a memory card. The input picture is subjected to multi-layer down-sampling coding to obtain a deep characteristic map with low resolution and rich semantic information.

A residual error soft connection module:

the traditional U-Net is connected between the last layer of convolution coding layer and the first layer of deconvolution decoding layer of the same stage through residual errors, so that some information lost after coding pooling can be directly transmitted to a decoding module. The information of the coding module is some low-layer characteristics, the characteristics in the decoding module are high-layer characteristics, and the two characteristics are directly spliced to possibly have a great semantic difference. The direct fusion of two incompatible features may cause the whole tongue segmentation model to generate differences in training, which may adversely affect the result. Meanwhile, the resolution of the low-level features is higher, more position and detail information is contained, but the lower-level features are less in semantic and more in noise due to fewer convolution operations. High-level features have stronger semantic information, but the resolution is very low, and the perception capability of details is poor. More noise influence exists in an open environment, and the low-level features are directly and rigidly fused with the high-level features, so that more noise can be directly transmitted to the high level. Therefore, more ambient noise needs to be filtered when residual is concatenated. The invention designs an improved residual error soft connection module to replace the original jump connection, as shown in fig. 5, the whole module is firstly a feature fusion part, and the initial feature graph of each stage is fused with the coded feature graph. The coding layer is a shallow network of two layers of 3x3, so the semantic difference of the features before and after coding is not large, and the fused features retain more tongue position information. The feature fusion is followed by a flexible connection coefficient convolution block of 1x1, so that the semantic difference between the low-level features and the high-level features is reduced, and the convolution feature extraction can trainable and reduce the noise transmitted from the low-level features to the high-level features. And finally fusing the noise-reduced feature block with the high-level features.

A decoding module:

and the decoding module continuously samples and restores the original characteristics of the image by the coded characteristic diagram until the resolution ratio of the image is consistent with that of the input image. In the period, the feature map after deep coding is fused with the features transmitted to the same layer through the residual error soft connection module, so that the detail information of image segmentation is better stored.

The saliency map fusion module:

the quality of the tongue body picture in an open environment is difficult to guarantee, and in order to more accurately segment the tongue body, it is more important to acquire the detail characteristics of each level. The overall principle of the saliency map module is shown in figure 6. The saliency map fusion module feeds supervision information into the middle layer, so that an independent network can be formed from input to the outer connection layers, and a plurality of independent networks are formed by a plurality of outer connection layers. The whole network is learned through multi-scale and multi-level features. Through the learned rich grading characteristics, the problem of detail segmentation in tongue body segmentation can be effectively solved. In the module, 5 saliency maps are acquired through 1x1 convolution firstly, then the five saliency maps are fused, and a fusion mapping is acquired through 1x1 convolution and sigmoid function.

Comparing each saliency map with the target output, obtaining the loss value and further iterating the whole model. The Loss functions used in the image segmentation model are cross entropy, mean square error and Dice Loss, but the weight of each pixel point is equal by the Loss functions. In tongue pictures acquired by different devices in an open environment, the tongue size accounts for different proportions of the whole image. For the tongue picture with small tongue proportion, when obvious errors occur in segmentation, the calculated loss value is small, and further under-segmentation is caused. Many pictures are shot by a rear camera of the mobile phone, and the ratio difference between the tongue body and the non-tongue body is large. The focus loss has the advantage that there is a large data difference between the positive and negative samples, and therefore, the optimal focus loss is selected to construct a new loss function as the loss function of the model. The focus loss function is shown in equation (1).

Gamma is a factor for adjusting the model under-segmentation, it is a number greater than 0, when it is 1, it is a cross-entropy function, alpha is a factor balancing the positive and negative samples, and the improved loss function focuses more on how to balance the severely unbalanced positive and negative samples. The segmentation loss function is weighted according to the pixel size of the tongue. For an image with a small tongue, the proportion of positive samples may be small, so that an image with a small tongue should be given more penalty weight to make such an image more sensitive to segmentation error, and the pixel weighting loss function is calculated by:

n is the training set size, y_iTarget segmentation graph, p (y), representing the ith graph_i) The prediction graph for the ith graph is shown. Alpha is alpha_iFor the penalty weight, the value is different according to the different proportions of the tongue body, and the specific calculation is as follows:

S_ithe tongue body in the ith picture accounts for the proportion of the whole picture; epsilon is a preset proportion threshold value to distinguish an image with small tongue body proportion from an image with large tongue body proportion; beta is a penalty factor used for defining penalty weight. When β is 1 and ε is 0, the loss function is a binary cross-entropy loss function. Model of experimental training requires L_segAnd minimum.

The specific implementation method of the loss function algorithm of the OET-NET model is shown in the table 1:

table 1 loss algorithm specific implementation method

After the processing of the steps, the image can accurately segment the tongue image in the natural environment, specifically, as shown in fig. 7 and P1 and P2, the tongue is light purple, the color of clothes worn by people in the image is also partial purple, and the model segmentation is disturbed to a certain extent. The yellow coating on the surface of the human tongue in P3 also reduces the model segmentation accuracy. However, the improved tongue segmentation model eliminates these environmental noise interferences, fusing the available information. Therefore, the method has better effect on tongue body segmentation in complex environment. Meanwhile, the size of the model and the model inference speed are also needed to be considered, the MIoU is used as an index of the segmentation accuracy rate, the model parameter is used as an index of the size of the model, the time for segmenting a picture is used as an index of the inference speed, a three-dimensional graph like a graph 8 is drawn, in the graph, the better the effect of the index which is far away from the origin and is close to the XOY plane model is, the more difficult the observation is, and the tongue segmentation accuracy under the natural condition is improved at the cost of extremely small space and time consumed by a new tongue segmentation model.

Example 2

Example 2 experiments were carried out using the method of the present invention.

1. Creation of data sets

All data is from tongue image datasets acquired from free community clinics. The ethical support certification number of the subsidiary hospital of the Chengdu traditional Chinese medicine university is 2021 KL-027. The acquisition equipment is the cell-phone of different models, and the image has five different resolutions: 320 × 240, 1920 × 1440, 3000 × 4000, 3840 × 5120, 6000 × 8000. Image acquisition is performed by the physician to ensure that all images contain all tongue diagnostic information and that accurate tongue diagnosis can be performed from these images. 1481 tongue images were collected in the tongue image dataset and were segmented and labeled by LabelMe.

2. Course of training

The OET-NET model takes U-Net as a backbone network, and randomly divides a data set into an 80% training data set and a 20% verification data set. An OET-NET and other control models are constructed through TensorFlow 2.0.2, and NVIDIA Tesla-100 GPU is adopted for training. Using Adam optimization function as the optimization function for OET-NET, the decay weight and learning rate were set to 10^-5For a total of 100epochs, the batch size is set to 12. Therefore, there are 9.8k batch iterations in total, the penalty weight in the weighted loss function is set to 100, the threshold is set to 0.05, and the variation of the loss value during the iteration is shown in fig. 9. After 78 batch training, the model loss remained stable.

3. Tongue segmentation experiment

In experiments, models were evaluated using joint mean intersection (MIoU), model parameter values, inference duration, and floating point operations per second to better explore the potential of the models in mobile device applications.

1) MIoU: the method is usually used for calculating the ratio of the intersection and the union of two sets, and is one of the gold indexes for calculating the semantic segmentation precision of the model. In the experiment, it is the value between the true image and the prediction map that is calculated. The calculation formula is as follows:

k +1 is the number of pixel classes, including null class, p_ijIs the value of the pixel of the ith class predicted as the jth class, i.e. a false positive value, p_iiIs the number of true positive values, p_jiIs the number of false negatives.

2) The total parameters are as follows: the size of the model can be intuitively reflected for the total number of parameters contained in the model. It is one of the important evaluation indexes for evaluating whether the model can be used for the micro-device.

3) And (3) reasoning time: it reflects the reasoning time of the model to a result, and the shorter the reasoning time is, the faster the model infers the result.

4) Floating-point operands: the smaller the amount of network traffic that it can measure forward propagation operations in the nervous system, the faster the computation speed may be. The calculation formula is shown as follows:

the calculation method of FLOPs is also different for convolutional networks and fully-connected networks. In the calculation formula of the convolution network, H and W respectively represent the shape of an input convolution kernel matrix; c_inAnd C_outRepresenting the number of input and output channels; k represents the size of the convolution kernel. In the calculation formula of the fully connected network, I and O represent an input dimension and an output dimension, respectively.

The tongue segmentation result of OET-NET is compared with U-Net and U²-Net、U²-Net⁺、SE U-Net、Attention U-Net、Resnet、DeepLResults of tongue segmentation for ab V3 are compared and shown in Table 2, and all results were obtained after 100epochs training. The open-environment tongue segmentation task requires an accurate, lightweight, and fast model. In the three basic frameworks of U-Net, DeepLabV3, and Resnet. DeepLabV3 and Resnet gave better segmentation results than the U-Net network. In order to ensure the objectivity of the judgment indexes, the deduced length and the total number of parameters are selected to be compared with the MIoU comprehensively. By comparing the index difference between the three models, the U-Net is found to be slightly inferior to the other two models in the MIoU, and the MIoU is a segmentation precision index, so that the precision difference between the U-Net and the other two models can be reduced by optimizing the models. And the defects of excessive parameters between the DeepLabV3 and Resnet and slow result inference are difficult to compensate. In the U-Net model improved by the attention mechanism, there is overfitting to the segmentation of the small data set with large variation, and the segmentation MIoU of both attention mechanism-based models is reduced. U shape²-Net and a simplified version of U²-Net⁺The method has good effect on tumor segmentation, greatly improves the segmented MIoU, but has slow reasoning speed. The OET-NET can be well adapted to the image resolution acquired by different equipment and the interference of a complex environment. Comparing U with U-Net²Net has the best segmentation effect, our model increases MIoU by 0.22. In terms of the number of model parameters, FLOPs and inference time, the added residual error soft connection module and the saliency map fusion module do not add too many parameters and computational workload to the original trunk model.

TABLE 2 comparison of indices between models

Fig. 10-14 show the actual segmentation effect for each model. Compared with the existing method, the OET-NET obtains a better semantic segmentation result. In fig. 10-14, the proportions of the tongue are 6.52%, 7.28%, 11.92%, 2.38% and 4.92%. The 5 different image scenes respectively display the segmentation effect of each model under various open environment interference factors, such as weak tongue-like object interference, strong tongue-like object interference, tongue fur interference, tongue body proportion, illumination, shooting angle and the like.

Fig. 10 illustrates an example of an actual segmentation result of each model, and the corresponding processing means in fig. 10 are: (a) tag visualization (b) real images; (c) U-Net; (d) DeepLabV 3; (e) resnet; (f) u shape²-Net；(g)U²-Net⁺(ii) a (h) SE U-Net; (i) attention U-Net; (j) OET-NET. The tongue is light purple with a small piece of similar color on the left side of the background, which interferes very weakly with the segmentation. The difficulty in detail segmentation of the image lies in the segmentation of the concave portion of the upper part of the tongue and the smooth segmentation of the tongue edge. By observing the segmentation results for each model, it can be easily seen that OET-NET captures the details most accurately than the other models. The OET-NET segmentation effect on the indented portion of the upper tongue body is relatively accurate, while the smooth segmentation effect on the edges is similar to other models.

Fig. 11 shows an example of an actual segmentation result of each model, where the corresponding processing means in fig. 11 are: (a) tag visualization (b) real images; (c) U-Net; (d) DeepLabV 3; (e) resnet; (f) u shape²-Net；(g)U²-Net⁺(ii) a (h) SE U-Net; (i) attention U-Net; (j) OET-NET. The person in fig. 11 wears a light purple shirt that looks very similar to the tongue body color, which strongly interferes with the tongue segmentation task. Because the lower left portion of the tongue is a shirt, segmenting the details of this portion is challenging. U-Net, DeepLabV3 and U²Net appears as an under-segmented result in such images. Due to the limitation of the amount of data in the training dataset, the over-segmentation of the U-Net model related to attention mechanism is evident. Both models wrongly segmented the collar and the face part. Because of the reduction in noise transmission and the accumulation of detailed features, which preserve more detailed location features, the OET-NET is able to exclude portions of the image that are similar to the tongue.

In an open environment, the disturbing factors for tongue segmentation come not only from the environment, but also from the tongue texture and tongue coating. The disturbance is more uncontrollable due to variations in the light. FIG. 12 shows three examples of the actual segmentation results of each model, and the corresponding processing means in FIG. 12 are respectively: (a) tag visualization (b) real images; (c) U-Net; (d) DeepLabV 3; (e) resnet; (f) u shape²-Net；(g)U²-Net⁺(ii) a (h) SE U-Net; (i) attention U-Net; (j) OET-NET. The tongue in fig. 12 accounts for the largest proportion of these five scenarios. Therefore, the details of the tongue body can be displayed more clearly in the image. In this picture, the tongue is covered with a yellow coating similar to the color of the participant's skin. Under the influence, U-Net and Resnet do not bring the yellow fur part into the tongue body, and the phenomenon of insufficient division is obvious. Compared to the first two people, Attention U-Net was over-segmented, segmenting portions of the skin into a coating on the tongue.

Fig. 13 shows an example of an actual segmentation result of each model, where the corresponding processing means in fig. 13 are: (a) tag visualization (b) real images; (c) U-Net; (d) DeepLabV 3; (e) resnet; (f) u shape²-Net；(g)U²-Net⁺(ii) a (h) SE U-Net; (i) attention U-Net; (j) OET-NET. In fig. 13, the tongue scale in the pictures is the smallest in the pictures described herein. In this case, the environmental factors have a greater influence on the segmentation. In this scenario, the detail segmentation of the tongue edges is very challenging due to the impact on the image exposure. The characteristics of the lower edge of the tongue closely resemble those of the non-tongue body in the exposed condition. Except for OET-NET, all models have a tongue tip segmentation phenomenon under the influence of illumination and clothes.

Fig. 14 shows an example of an actual segmentation result of each model, where the corresponding processing means in fig. 14 are: (a) tag visualization (b) real images; (c) U-Net; (d) DeepLabV 3; (e) resnet; (f) u shape²-Net；(g)U²-Net⁺(ii) a (h) SE U-Net; (i) attention U-Net; (j) OET-NET. Fig. 14 shows that environmental factors can reduce the segmentation effect of the low-ratio tongue image. U-Net and U under the influence of tongue projection onto chin shadow²-Net⁺Sensitive to these interference factors, but insufficient segmentation still exists. U shape²Tongue tip segmentation by-Net is also deficient. Under all interference conditions, the OET-NET can effectively overcome interference factors and obtain a good segmentation effect.

Meanwhile, the accuracy of OET-NET segmentation of tongue images acquired by different mobile phones by people is verified. Experiments were performed with selection of glory 20, hua Mate 40E, hua NOVA 8, iPhone 12, iPhone X, OPPO R15X, and millet Mix2 for validation. The MIoU is selected as an accurate index, the result is shown in fig. 15, fig. 15 shows three actual segmentation results of different mobile end devices, and the adopted mobile phones are respectively: (a) HONOR 20; (b) HUAWEI Mate 40E; (c) HUAWEI NOVA 8; (d) iPhone 12; (e) iPhone X; (f) OPPO R15X; (g) xiaomi Mix 2. It can be seen that by adopting different mobile devices, the segmentation result of the OET-NET model in three unified environments keeps higher MIoU.

4. Ablation experiment

And the effectiveness of the newly set loss function, the residual error soft connection block and the obvious image fusion module is verified through an ablation experiment. And discusses the contribution of the three optimizations to the overall model optimization. The results of the ablation experiments are shown in table 3. Compared with the U-Net, the tongue image segmentation precision of the model in an open environment can be improved by optimizing the loss function, connecting the residual error soft blocks and adding the saliency map fusion module after each stage. The improved loss function can further optimize the detail segmentation precision. And when saliency map fusion and residual soft joining are used simultaneously, the MIoU of model segmentation drops to 94.12%. By combining three kinds of optimization, more features are fused, noise transmission is reduced, attention to the segmented regions is enhanced, the segmentation efficiency of the model is further improved, and the MIoU reaches 96.98% to the maximum. The following is an analysis of the segmentation results of the actual tongue image, fused into the model according to the different modules.

TABLE 3 ablation test results

Fig. 16 is a first graph of the actual segmentation effect of the corresponding module in the ablation experiment, where the first graph corresponds to: (a) a real image; (b) U-Net; (c) U-Net with area weighted loss function; (d) U-Net with saliency map fusion module; (e) U-Net with residual soft connection module; (f) the U-Net comprises a region weighting loss function and a residual error soft connection module; (g) the U-Net comprises a saliency map fusion module and a residual error soft connection module; (h) U-Net with saliency map fusion module and region weighted loss function; (i) OET-NET. The tongue in fig. 16 can achieve a relatively accurate segmentation effect by using U-Net, but still has some drawbacks. Existing models are still in need of improvement to segment the indented portion of the upper edge of the tongue. Similar to the attention mechanism, for small sample size data sets, excessive transmission of low level detail tends to result in excessive segmentation. This result occurs when significant fusion in the graph is combined with residual soft-joining in the graph. But as can be seen from the results, the saliency map fusion is more detailed optimization of the salient part of the upper right edge of the lifting tongue, and the residual soft connection is more detailed optimization of the lower right edge of the tongue body. After the constraint of the loss function, the segmentation precision of the tongue in the image can be better improved by integrating more detailed segmentation and OET-NET.

Fig. 17 is a second graph of the actual segmentation effect of the corresponding module in the ablation experiment, where the images respectively correspond to: (a) a real image; (b) U-Net; (c) U-Net with area weighted loss function; (d) U-Net with saliency map fusion module; (e) U-Net with residual soft connection module; (f) the U-Net comprises a region weighting loss function and a residual error soft connection module; (g) the U-Net comprises a saliency map fusion module and a residual error soft connection module; (h) U-Net with saliency map fusion module and region weighted loss function; (i) OET-NET. FIG. 17 illustrates the segmentation optimization of the interference of the modules on the environmental tongue objects and the tongue coating itself. The task optimization was mainly the lower left edge of the tongue coating. As can be seen, the three modules are used independently to a certain extent, so that the interference of external objects is eliminated, and the completeness of tongue segmentation is ensured. The residual soft-join shows a better detail split from the perspective of the two modules that enhance the detail features.

Fig. 18 is a third diagram of the actual segmentation effect of the corresponding module in the ablation experiment, where the images respectively correspond to: (a) a real image; (b) U-Net; (c) U-Net with area weighted loss function; (d) U-Net with saliency map fusion module; (e) U-Net with residual soft connection module; (f) the U-Net comprises a region weighting loss function and a residual error soft connection module; (g) the U-Net comprises a saliency map fusion module and a residual error soft connection module; (h) U-Net with saliency map fusion module and region weighted loss function; (i) OET-NET. Fig. 18 shows that each module of the OET-NET is segmented in the picture by the interference of the tongue coating and the proportion of the tongue body in the picture is large. And segmenting the center of the tongue coating by adding a saliency map fusion or residual error soft connection module. As the transmission of detail features increases, the tongue body is more accurate. Surprisingly, the original loss function and the improved loss function improve the segmentation efficiency of the model compared to using the two modules separately. The simultaneous use of saliency map fusion and soft residual concatenation results in excessive segmentation in the graph segmentation task. Under the constraint of the improved loss function, the model after the new three-mode fast fusion has the best effect.

Fig. 19 is a graph four of the actual segmentation effect of the corresponding module in the ablation experiment, where the images respectively correspond to: (a) a real image; (b) U-Net; (c) U-Net with area weighted loss function; (d) U-Net with saliency map fusion module; (e) U-Net with residual soft connection module; (f) the U-Net comprises a region weighting loss function and a residual error soft connection module; (g) the U-Net comprises a saliency map fusion module and a residual error soft connection module; (h) U-Net with saliency map fusion module and region weighted loss function; (i) OET-NET. Fig. 19 is a segmentation of a tongue image with a small tongue ratio in an abnormal lighting environment. Only with saliency map fusion will result in part of the wrong illumination information being passed back, resulting in extreme under-segmentation. Saliency map fusion with residual soft joining reduces the transfer of illumination disturbance information and more accurately segments the edge detail of the tongue image. Therefore, in the task of open-environment tongue segmentation, it is very necessary to integrate the two for better and accurate tongue segmentation in multi-light environment. Meanwhile, the improved loss function has significance for the constraint of tongue body classification because the tongue body has a smaller proportion in the drawing. In the case where only the loss function is increased, the segmentation performance of the model is also improved. By integrating the three elements, noise transmission can be reduced, details can be enhanced, and the constraint conditions of tongue images with different proportions can be improved, so that the tongue images can be better segmented under the high exposure condition.

Fig. 20 is a graph five of the actual segmentation effect of the corresponding module in the ablation experiment, wherein the images respectively correspond to: (a) a real image; (b) U-Net; (c) U-Net with area weighted loss function; (d) U-Net with saliency map fusion module; (e) U-Net with residual soft connection module; (f) the U-Net comprises a region weighting loss function and a residual error soft connection module; (g) the U-Net comprises a saliency map fusion module and a residual error soft connection module; (h) U-Net with saliency map fusion module and region weighted loss function; (i) OET-NET. Fig. 20 shows mainly the detailed segmentation of the sublingual margin shadow by each module, and the segmentation of the upper and lip interface. The fusion of the salient images has a good segmentation effect on the lower edge shadow, and the soft residual connection enables the segmentation of the lip-tongue junction to be smoother. Because of the scale of the tongue in the figure. Slightly less than its threshold of 0.05, the loss function also plays a role in the segmentation result.

The invention designs a model which can accurately segment tongue images in natural environment, and the result shows that the method achieves the best segmentation effect. Because the trunk of the model is U-Net, the difference between the total number of the model parameters and the inference time is very small. The tongue images shot from the natural environment have different tongue body proportions in the images, and the model is easily lost due to the fact that the tongue body proportion is too small. This is one of the reasons for optimizing the focus loss to construct a new loss function.

In image segmentation, attention mechanisms are often used to focus on a particular target. Surprisingly, the final segmentation effect of both models improved with the attention mechanism is not as good as U-Net itself. Under open environmental conditions, pictures are diverse in both space and channel. There are significant differences between pictures due to differences in color temperature only. The present study, based on a model constructed from 1481 images collected by a volunteer clinic, is still a car pay cup for complex environmental changes even if the image dataset is expanded in a short time. Therefore, the attention mechanism is not suitable for this task. And U²OET-NET does not introduce residual U-block modules compared to Net modules, but chooses to add a lightweight to the U-Net modelA level optimization module to optimize optimization of the loss function. The segmentation effect is equivalent, so that the method has a comparative advantage in inference time.

There are already many models that can achieve an accurate segmentation of the tongue in a scene with appropriate illumination and small environmental disturbances. However, most of these methods are not robust enough. The open environment is not as environmentally limited as the standard environment and the variation in illumination makes it more challenging to segment the details of the tongue edges. OET-NET uses a residual soft block and a convolution block to do this, which also proves its superiority in practical segmentation. U shape²Net uses direct concatenation of high and low level features, delivering more noise in the optical processing process, with less effect than OET-Net.

The optimized loss function is more like the proportion of the uneven tongue image. The residual soft-connect may filter redundant environmental noise in addition to redundant environmental details and location information. The detail of the salient map fusion module on the edge segmentation of the tongue segmentation algorithm is optimized. Compared with a residual soft connection module, the saliency map fusion is equivalent to a deeper feature fusion module and is superior to the residual soft connection module in the aspect of integral segmentation optimization of a single module. The saliency map fusion module is located in the decoder of the model and therefore lacks low-level position information. Therefore, there is a need to combine two modules to achieve better detail division. The combination of residual soft joining and saliency map fusion, similar to the attention mechanism, can produce excessive segmentation in a small number of datasets. This phenomenon can be effectively prevented by improved constraint iteration of the loss function on the position and image scale. Thus, there is also a loss function after each saliency map fusion. The improved loss function can effectively connect the situation of insufficient segmentation. All three modules play a certain role in model optimization, and an optimization model based on the three modules can better segment the tongue image in an open environment. OET-NET shows the characteristics of small parameter quantity, rapidness and accuracy in experiments. Through comparison and verification of different mobile phone models, the method is considered to have great application potential on mobile equipment. Therefore, the intelligent tongue diagnosis can be carried out anytime and anywhere.

An objective intelligent diagnosis system is constructed based on the algorithm.

While there have been shown and described what are at present considered the fundamental principles and essential features of the invention and its advantages, it will be apparent to those skilled in the art that the invention can be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, the embodiments do not include only one independent technical solution, and such description is only for clarity, and those skilled in the art should take the description as a whole, and the technical solutions in the embodiments may be appropriately combined to form other embodiments that can be understood by those skilled in the art.

Claims

1. A complex environment rapid tongue segmentation device based on a mobile terminal is characterized by comprising the following steps:

s1, acquiring a test image including a tongue body;

s2, inputting the test image into the well-trained OET-NET model, and outputting a tongue image segmented from the test image; the OET-NET model has the following architecture: the skip-join module in the U-Net model is replaced with a residual soft-join module, and a saliency map fusion module is added after each level of the U-Net model, and the penalty function is replaced with a pixel-weighted cross-entropy penalty function.

2. The apparatus of claim 1, wherein the OET-NET model comprises an encoding module, a residual soft-connection module, a decoding module and a saliency map fusion module,

the residual error soft connection module fuses the initial characteristic graph of each level with the coded characteristic graph; then connecting a flexible connection coefficient rolling block of 1x1, and finally fusing the noise-reduced feature block with the high-level features;

3. The apparatus according to claim 2, wherein the pixel-weighted cross entropy loss function is:

n is the training set size, y_iLabel diagram, p (y), representing the ith diagram_i) Shows a prediction map for the ith map, α_iFor the penalty weight, the value is different according to the different proportions of the tongue body, and the specific calculation is as follows:

4. The device for rapid tongue segmentation in complex environment based on mobile terminal according to any one of claims 1 to 3, wherein the resolution of the test image of the tongue body includes but is not limited to: 320 × 240, 1920 × 1440, 3000 × 4000, 3840 × 5120, 6000 × 8000.

5. The device as claimed in claim 4, wherein the penalty weight is set to 100 and the threshold is set to 0.05 during the OET-NET model training process.