CN112561865B

CN112561865B - Method, system and storage medium for training detection model of constant molar position

Info

Publication number: CN112561865B
Application number: CN202011406995.3A
Authority: CN
Inventors: 黄少宏; 赵志广; 范卫华; 李菊红; 易超; 林良强; 李剑波; 武剑; 朱佳; 刘勇; 严志文; 邢玉林
Original assignee: Shenzhen Gree Health Technology Co ltd
Current assignee: Shenzhen Gree Health Technology Co ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2024-03-12
Anticipated expiration: 2040-12-04
Also published as: CN112561865A

Abstract

The invention discloses a method, a system and a storage medium for training a detection model of a constant molar position, wherein the method comprises the following steps: acquiring a plurality of oral dentition images as first images; extracting first features of the first image to obtain a first feature image; selectively activating the first characteristic image in a channel dimension and a space dimension respectively; performing second feature extraction on the first feature image after the activation; a detection frame for predicting the constant molar position according to the image features extracted by the second features; fusing the information of the detection frames to generate a first confidence coefficient; taking the maximum first confidence coefficient as a prediction output result; and reversely updating parameters of the detection model according to the prediction output result. The detection model provided by the invention has pertinence in the application process and can be more suitable for different application situations, so that the accuracy of the detection result of the detection model in the application process is improved. The invention can be widely applied to the technical field of model training.

Description

Method, system and storage medium for training detection model of constant molar position

Technical Field

The invention relates to the technical field of model training, in particular to a method, a system and a storage medium for training a detection model of a constant molar position.

Background

Caries is a chronic disease in which bacteria (cariogenic bacteria) produce acid from carbohydrates in food, causing progressive destruction of hard tissues of teeth, and is a major common disease of the oral cavity, and one of the most common diseases in humans. Children of school age 6-12 years old often have a preference for sweet food, soft food and sticky food and are easy to stick to teeth, however, oral hygiene habits of children at this stage are often poor or a cleaning method is not well mastered, teeth can be cleaned effectively, and the children are high-incidence people suffering from caries. In addition, during the period of replacing teeth, the newly erupted permanent teeth, especially the newly erupted permanent molar fossa are often deeper, bacteria are easy to accumulate and are not easy to clean, acidic secretion produced by the bacteria can cause the damage of hard tissues of teeth, and dental caries is easy to produce. Thus, timely screening and early intervention in the oral condition of children is the root cause of the disease prevention and treatment of caries. Pit and groove sealing is the best method recommended by world health organization for preventing constant dental caries, has been widely popularized in many countries and regions, and widely popularized in China for school-age children. The pit and groove sealing refers to a method that pit and groove sealing materials are coated on pit and groove point gaps of occlusal surfaces and buccal surfaces of dental crowns without damaging dental tissues, and are solidified and hardened after flowing into and penetrating into the pit and groove to form a protective barrier which covers the deep pit and groove, so that caries causing bacteria and acidic metabolites can be prevented from corroding dental bodies, and the pit and groove caries preventing method is achieved.

The first step in the occlusion of the sulcus is to screen out permanent molars that are in compliance with the indication. In the past, the pit and fissure sealing project is screened by the manpower of an oral doctor, and a great deal of manpower, material resources and financial resources are required to be spent. In order to save manpower, material resources and financial resources of pit and trench sealing projects, an online tooth detection and early warning method is provided. However, due to the lack of pertinence and adaptation of the current online detection model, the accuracy of the recognition result of the constant molars is not high in the application process.

Disclosure of Invention

In order to solve one of the above technical problems, the present invention aims to: a method, system and storage medium for training a detection model of a constant molar position are provided, which can improve the accuracy of a detection result of the detection model in an application process.

In a first aspect, embodiments of the present invention provide:

a method for training a detection model of a constant molar position comprises the following steps:

acquiring a plurality of oral dentition images as first images;

extracting first features of the first image to obtain a first feature image;

selectively activating the first characteristic image in a channel dimension and a space dimension respectively;

performing second feature extraction on the first feature image after the activation;

a detection frame for predicting the constant molar position according to the image features extracted by the second features;

fusing the information of the detection frames to generate a first confidence coefficient;

taking the maximum first confidence coefficient as a prediction output result;

and reversely updating parameters of the detection model according to the prediction output result.

Further, the acquiring a plurality of oral dentition images as the first image includes:

acquiring a plurality of oral dentition images;

the sizes of the oral cavity dentition images are processed into preset sizes;

interlacing sampling is carried out on the oral cavity dentition image after the size processing to obtain a plurality of sub-images;

and taking the spliced image of the plurality of sub-images as a first image.

Further, the performing the first feature extraction on the first image to obtain a first feature image includes:

performing downsampling feature extraction on the first image to generate feature images with different sizes;

the feature images of different sizes are taken as first feature images.

Further, the selectively activating the first feature image in the channel dimension includes:

carrying out average pooling and maximum pooling on the first characteristic image;

and selectively activating the first characteristic images after the average pooling and the maximum pooling in the channel dimension by adopting an attention mechanism. Further, the selectively activating the first feature image in the spatial dimension specifically includes:

the first image processed by the convolution kernel attention mechanism is selectively activated in the spatial dimension.

Further, the second feature extraction is an abstract feature extraction.

Further, the fusing the information of the detection frame to generate a first confidence coefficient includes:

fusing the position information of the detection frame, the second confidence coefficient and the image characteristic information;

and generating a first confidence coefficient according to the fusion result.

In a second aspect, embodiments of the present invention provide:

a model training system for detecting a constant molar position, comprising:

the acquisition module is used for acquiring a plurality of oral cavity dentition images as first images;

the first feature extraction module is used for carrying out first feature extraction on the first image to obtain a first feature image;

the activation module is used for selectively activating the first characteristic image in the channel dimension and the space dimension respectively;

the second feature extraction module is used for carrying out second feature extraction on the first feature image after the selection activation;

the prediction module is used for predicting a detection frame of the constant molar position according to the image features after the second features are extracted;

the fusion module is used for fusing the information of the detection frame to generate a first confidence coefficient;

the confidence coefficient selection module is used for taking the maximum first confidence coefficient as a prediction output result;

and the parameter updating module is used for reversely updating the parameters of the detection model according to the prediction output result.

In a third aspect, embodiments of the present invention provide:

a model training system for detecting a constant molar position, comprising:

at least one memory for storing a program;

at least one processor for loading the program to perform the method of model training for detection of the position of the permanent molars.

In a fourth aspect, embodiments of the present invention provide:

a storage medium having stored therein a processor executable program which when executed by a processor is for performing the method of model training for detection of a constant molar position.

The embodiment of the invention has the beneficial effects that: according to the embodiment of the invention, the first characteristic extraction is firstly carried out on a plurality of oral cavity dentition images, then the first characteristic images are respectively and selectively activated in the channel dimension and the space dimension, the second characteristic extraction is carried out on the first characteristic images after the selective activation, then the detection frame of the constant molar position is predicted according to the image characteristics after the second characteristic extraction, the information of the detection frame is fused, the first confidence coefficient is generated, finally the maximum first confidence coefficient is used as a prediction output result, and the parameters of the detection model are reversely updated according to the prediction output result, so that the detection model after the training of the embodiment has pertinence and can be more suitable for different application situations in the application process, and the accuracy of the detection result of the detection model in the application process is improved.

Drawings

FIG. 1 is a flow chart of a method for training a model for detecting the position of a permanent molar in accordance with an embodiment of the present invention.

Detailed Description

The invention will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

Referring to fig. 1, the embodiment of the invention provides a method for training a detection model of a constant molar position, which can be applied to a server or a background processor of various platforms, so that a user only needs to perform a designated step operation on a man-machine interface in an application process, and the constant molar position detection can be completed.

The embodiment comprises the following steps:

s1, acquiring a plurality of oral cavity dentition images as first images; in this step, the oral cavity dentition image may be an image of the person to be detected acquired in the actual detection process, where the acquisition process may be that the person to be acquired is photographed by a mobile phone or other devices with a photographing function.

When the plurality of oral cavity dentition images are applied to the training process of the detection model, in some embodiments, data enhancement modes such as data overturning and rotation can be adopted to carry out data enhancement on the oral cavity dentition images so as to provide more training sample data and improve training precision.

In some embodiments, the step S1 may be implemented by:

acquiring a plurality of oral dentition images; the plurality of oral cavity dentition images are images shot by different mobile intelligent equipment, and the sizes of the plurality of oral cavity dentition images are processed into the preset sizes because the shot images are inconsistent in size and the input size of the neural network is of a fixed size. In this process, the image size may be set to 512×512, i.e. the input image data size is c×w×h, where c=3 represents the number of input image channels is 3, and each channel sequentially represents the color values on the red, green, and blue components, respectively; w=512 is the width of the image; h=512 is the height of the image. Then, carrying out interlaced sampling on the oral cavity dentition image subjected to the size processing to obtain a plurality of sub-images; for example, for an image of 512 x 512 size, starting from each point in a 2 x 2 region in the upper left corner of the image, the original image is gap sampled at intervals of length 2, generating 4 sub-images of 256 x 256. And finally, taking the image formed by splicing the plurality of sub-images as a first image input by the detection model to fully utilize the information on the image.

S2, performing first feature extraction on the first image to obtain a first feature image.

In this step, a first feature extraction is performed on the first image by the feature extraction module. The feature extraction modules are formed by overlapping 8 layers of network modules, 2 layers of the 8 layers of networks form a downsampling feature extraction module group as a whole, the downsampling feature extraction module group is divided into four groups, and each group of operation generates a feature map with corresponding size. After feature extraction is performed by the downsampling feature extraction module group, 4 feature images with the sizes of 512 x 512, 256 x 256, 128 x 128 and 64 x 64 can be obtained, and finally feature images with different sizes are used as first feature images.

S3, selectively activating the first characteristic image in the channel dimension and the space dimension respectively;

in some embodiments, the first feature image is selectively activated in the channel dimension by:

and selectively activating the first characteristic images after the average pooling and the maximum pooling in the channel dimension by adopting an attention mechanism.

In some embodiments, the first feature image is selectively activated in a spatial dimension, which may be in the following manner:

In the above embodiment, the attention mechanism and the convolution kernel form an attention mechanism module, specifically, the selective activation of the channel dimension is performed first, the feature map is processed by the maximum pooling and the average pooling respectively, then the feature maps obtained by the two operations are selectively activated to corresponding feature channels respectively, wherein the selection of the activation channel is a parameter that can be iterated through a neural network to perform self-learning, and finally the two features obtained after the activation of the channel dimension are overlapped and fused together to be used as the feature after the activation of the channel dimension. And secondly, selectively activating the space dimension, wherein the part adopts a strategy of simultaneously utilizing the characteristic map after average pooling and maximum pooling as input, and expands the receptive field by utilizing a convolution kernel of 7*7, and maps the convolved output to the importance degree of each position in the characteristic map relative to the prediction target through a sigmoid activation function.

Specifically, in the selective activation of the channel dimension, the calculation is performed using equation 1:

wherein F is ^S For features that are selectively activated through the channel dimension,representing the image features after averaging pooling, < >>Representing the image characteristics after maximum pooling, omega ₀ And omega ₁ For two different matrix parameters, which are used to represent the weight sizes for different channel numbers, σ represents the activation function sigmoid.

In the step, two identical parameters are applied to the feature map processed in different modes, so that the parameters have stronger robustness, and the feature map is mapped by continuously applying the two parameters, so that more neurons can be included to fit the feature as complex as possible on the basis of ensuring the feature map to be unchanged in size. Finally, mapping from real domain R to [0,1] is accomplished with an activation function represented by σ to achieve normalization of the output.

In the selective activation of the spatial dimension, the calculation is performed using equation 2:

wherein f ^7*7 A convolution operation is represented as 7*7,representing image features averaged pooling on the basis of features obtained by selective activation of channel dimensions, a>Representing image features obtained by maximum pooling based on features obtained by selective activation of channel dimensions, σ representing the activation function sigmoid.

In this step, the receptive field is enlarged by applying a larger convolution kernel, so that the model is helped to infer the importance degree of the current position according to the content of the current pixel point and the field thereof.

S4, extracting second features on the first feature images after the selection and activation; in this step, the second feature extraction is an abstract feature extraction.

S5, predicting a detection frame of the constant molar position according to the image features extracted by the second features; in this step, it specifically predicts the position and size of the detection frame on the first image.

And S6, fusing the information of the detection frames to generate a first confidence coefficient.

In some embodiments, step S6, which may be specifically implemented by:

and generating a first confidence coefficient according to the fusion result.

In this embodiment, the second confidence coefficient is the confidence coefficient of the image itself, and the first confidence coefficient is the confidence coefficient after the same image fusion process as the second confidence coefficient.

Since the existing model prediction detection frame may have a certain deviation, for example, two teeth of the same type appear on the same side, which is a phenomenon that is impossible in practical situations. Thus, in some embodiments, a detection frame screening mechanism is employed that incorporates multimodal information fusion. Specifically, the existing detection frame screening method only judges according to the confidence coefficient generated by the regression model and the overlapping area between different detection frames, and ignores the image characteristic information of the candidate detection frames. In this embodiment, the position information and the image feature of the candidate frame are regarded as information of two different modes, and the detection frame can be effectively screened by fusing the information of the two different modes.

Specifically, when information of different modalities is fused, formula 3 and formula 4 are adopted:

wherein f _A For the image features after the confidence information is fused, f _I To fuse the confidence information after image features,and->The original image characteristics and the confidence information are respectively, and the rest variables are parameters of the neural network self-learning.

On the basis of fusing information of different modes, calculating by combining the position information of the image to obtain the confidence coefficient of the detection frame fusing the multi-mode information, wherein the specific calculation process is shown in a formula 5-a formula 8:

f _S ＝σ(W _S f _total )+b _S equation 8

Wherein,representing features fused with confidence information and image feature information, < >>For the coded detection frame position information, f _total,k Then indicate at +.>Further fusing the characteristics of the position information of the detection frame, f _S And representing the new confidence coefficient finally obtained after the multi-mode information is fused, wherein the rest letters or letter combinations represent parameters required by the neural network.

S7, taking the maximum first confidence coefficient as a prediction output result; in this step, the prediction output result with the highest confidence is obtained for each region.

And S8, reversely updating parameters of the detection model according to the prediction output result, so that the detection model has pertinence in the application process and can be more suitable for different application situations, and the accuracy of the detection result of the detection model in the application process is improved.

The embodiment is specifically applied and comprises the following steps:

color images acquired by the mobile intelligent device are acquired, and angles, light rays and covered area conditions of the images are different. In this embodiment, the size of the image is unified to 512×512.

The image dataset was divided into training and testing sets at a ratio of 4:1, i.e., 3316 pictures in the dataset were used to train the model, and the remaining 829 pictures were used to verify the performance of the model. Final test results as shown in table 1:

TABLE 1

Method	AP	AP ₅₀	AP ₇₅	AR	AR ₅₀	SR ₇₅	AP _molar1	AP _molar2	Time(ms)
										Baseline(D)	46.2	92.5	40.2	48.4	99.7	70.3	92.7	94.3	6.6
Baseline(O)	44.5	89.7	37.2	46.7	98.1	64.3	-	90.1	6.1
										Baseline(A+D)	46.7	93.2	40.8	48.6	99.7	70.8	93.5	95.4	8.9
Baseline(AN+D)	47.9	94.5	41.6	48.5	98.9	72.5	94.2	97.3	10.4
										Baseline(AN+A+D)	49.1	95.6	42.3	49.1	99.2	72.6	96.1	98.5	12.4

In table 1, AP and AR are taken as evaluation indexes, that is, the IOU values of the detection frame and the target frame are regarded as correct when they are greater than the threshold value. Wherein Baseline (D) represents the results when the first premolars and the second premolars are detected simultaneously, baseline (O) represents the results when only the second premolars are detected, A represents the add-on mechanism, AN represents the add-on A-NMS detection frame screening mechanism.

In summary, the above embodiment limits the range of target detection by adding the attention mechanism, which greatly increases the adaptability and robustness of the method. Meanwhile, a detection frame screening mechanism is set, the image features and the position features of the target are synthesized to further select the detection frame, and candidate frames with the image features matched with the position features are reserved, so that the accuracy of the detection result is improved on the same data set.

The embodiment of the invention provides a constant molar position detection model training system corresponding to the method of fig. 1, which comprises the following steps:

The content of the method embodiment of the invention is suitable for the system embodiment, the specific function of the system embodiment is the same as that of the method embodiment, and the achieved beneficial effects are the same as those of the method.

The embodiment of the invention provides a detection model training system for a constant molar position, which comprises the following steps:

at least one memory for storing a program;

Embodiments of the present invention provide a storage medium having stored therein a processor-executable program which, when executed by a processor, is adapted to perform the method of model training for detection of a constant molar position.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments described above, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. The method for training the detection model of the constant molar position is characterized by comprising the following steps of:

acquiring a plurality of oral dentition images as first images;

extracting first features of the first image to obtain a first feature image;

taking the maximum first confidence coefficient as a prediction output result;

reversely updating parameters of the detection model according to the prediction output result;

the selectively activating the first feature image in the channel dimension includes:

selectively activating the first characteristic images after the average pooling and the maximum pooling in the channel dimension by adopting an attention mechanism;

in the selective activation of channel dimensions, the calculation is performed using equation 1:

equation 1

Wherein,for features after selective activation via channel dimension, < ->Representing the image features after the averaging pooling,representing the image features after maximum pooling, < >>And->For two different matrix parameters representing the weight sizes for different channel numbers,/->Representing an activation function sigmoid;

the selective activation of the first feature image in the spatial dimension is specifically:

selectively activating the first image processed by adopting a convolution check attention mechanism in a space dimension;

equation 2

Wherein,representing 7*7 convolution operation,/->Representing image features averaged pooling on the basis of features obtained by selective activation of channel dimensions, a>Representing image features resulting from maximum pooling based on features obtained by selective activation of channel dimensions, and>representing an activation function sigmoid;

the fusing the information of the detection frame to generate a first confidence coefficient comprises the following steps:

generating a first confidence coefficient according to the fusion result, wherein the second confidence coefficient is the confidence coefficient of the image, and the first confidence coefficient is the confidence coefficient of the same image after fusion processing with the second confidence coefficient;

when information of the detection frames is fused, a formula 3 and a formula 4 are adopted when information of different modes is fused:

equation 3

Equation 4

Wherein,for the image feature after fusing the second confidence information,/I>To fuse the confidence information after image features,and->Original image characteristics and confidence information respectively;

equation 5

Equation 6

Equation 7

Equation 8

Wherein,representing features fused with the second confidence information and the image feature information,/for the feature>For the position information of the encoded detection frame, < >>Then indicate at +.>Further fusing the features of the position information of the detection frame,/on the basis of (a)>And representing the new confidence coefficient finally obtained after the multi-mode information is fused as a first confidence coefficient.

2. The method of claim 1, wherein the acquiring a plurality of oral dentition images as the first image comprises:

acquiring a plurality of oral dentition images;

the sizes of the oral cavity dentition images are processed into preset sizes;

and taking the spliced image of the plurality of sub-images as a first image.

3. The method for training a model for detecting a constant molar position according to claim 1, wherein the performing the first feature extraction on the first image to obtain a first feature image comprises:

the feature images of different sizes are taken as first feature images.

4. The method of claim 1, wherein the second feature extraction is an abstract feature extraction.

5. A model training system for detecting a constant molar position, comprising:

the parameter updating module is used for reversely updating the parameters of the detection model according to the prediction output result;

equation 1

equation 2

equation 3

Equation 4

equation 5

Equation 6

Equation 7

Equation 8

6. A model training system for detecting a constant molar position, comprising:

at least one memory for storing a program;

at least one processor for loading the program to perform the method of model training for detection of permanent molar positions according to any one of claims 1-4.

7. A storage medium having stored therein a processor executable program which when executed by a processor is for performing the constant molar position detection model training method according to any one of claims 1-4.