CN114842238B

CN114842238B - Identification method of embedded breast ultrasonic image

Info

Publication number: CN114842238B
Application number: CN202210349097.1A
Authority: CN
Inventors: 孙自强; 龚任
Original assignee: Suzhou Shishang Medical Technology Co ltd
Current assignee: Suzhou Shishang Medical Technology Co ltd
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2024-04-16
Anticipated expiration: 2042-04-01
Also published as: CN114842238A

Abstract

The invention discloses an identification method of an embedded breast ultrasound image, which comprises the following steps: constructing a target detection network model and a target classification network model based on dynamic ultrasonic image breast lesions, and respectively training the target detection network model and the target classification network model; cutting the trained target detection network model, deploying the cut target detection network model and the trained target classification network model into an embedded system as a sub-network to generate an embedded breast ultrasound image recognition network system; inputting the breast ultrasonic image to be identified into a target detection sub-network for screening, and outputting a screening result; and inputting the feature vector matrix of the video sequence dynamic image detected positively into a target classification sub-network, outputting a classification result, and generating a breast ultrasound image recognition result according to the classification result. The method effectively reduces the false positive rate of breast ultrasonic image identification on the premise of ensuring lower false negative identification rate, provides reference for medical staff, facilitates more accurate judgment of the illness state and reduces missed diagnosis.

Description

Identification method of embedded breast ultrasonic image

Technical Field

The invention relates to the technical field of artificial intelligence and ultrasonic medical image processing, in particular to an embedded breast ultrasonic image identification method.

Background

Ultrasound and molybdenum target X-rays are used as main auxiliary means for screening breast cancer in many countries and regions, and the breast ultrasound technology has the advantages of no wound, rapidness, strong repeatability, no radiation and the like, can clearly display the changes of the soft tissues of each layer of the breast and the forms, the internal structures and the adjacent tissues of the tumor in the soft tissues, and brings great convenience to the screening work of breast diseases. In recent years, along with the rapid development of computer technology and the progress of medical technology, an artificial intelligence technology has substantially progressed in auxiliary examination of breast cancer, and the combination of the artificial intelligence technology and an ultrasonic imaging technology is applied to auxiliary screening of breast cancer, so that transition dependence on ultrasonic skills and experiences of medical staff in the screening process of breast cancer can be effectively reduced, screening implementation threshold is reduced, bottlenecks of shortage of professionals and uneven diagnosis level are broken through, layering and scale of breast cancer screening are realized, reference is provided for medical staff, and the medical staff can more accurately judge the condition of breast cancer.

Because of the limitations of imaging equipment, scanning methods and other factors, the ultrasonic image is inevitably influenced by factors such as noise, artifacts and the like in the imaging process, and meanwhile, in view of the wettability of breast tumors, the contrast and resolution of focus images are low, the boundary is fuzzy, so that characteristics are difficult to extract, and particularly, the information content in a single frame image is often too little to enable a system to make more accurate judgment, so that the problems of false identification and missing identification are easy to occur.

The following main problems exist in the technology of intelligent detection and auxiliary diagnosis analysis of AI ultrasonic images based on artificial intelligence deep learning and in clinical application:

1. the real-time performance is poor, the deep learning is a computationally intensive technology, and has higher requirements on CPU, GPU and the like. Therefore, most of the existing medical image artificial intelligence systems and devices are deployed and operated on workstations or remote cloud terminals on a GPU and CPU operation platform based on large-scale super-strong computing power, the application scene is severely restricted by the field, environment and surrounding communication network quality, the problems of slow response, large delay and the like are generally caused, and the experience and the use efficiency of medical staff on the system are greatly limited.

2. The existing auxiliary focus identification, positioning and diagnosis analysis methods of the breast ultrasonic image are all based on static ultrasonic images, and for dynamic ultrasonic image identification, the problems of low speed and high false positive rate are common.

3. The current network models of the medical image AI system basically adopt network models with wider and deeper depth and width, and the network models are deployed on a remote cloud or on a large-sized high-end workstation or high-end ultrasonic imaging equipment with strong edge computing capability, so that the cost and portability of the network models severely restrict the wide application of the products in basic medical institutions. Along with the popularization of portable ultrasonic imaging equipment in basic medical institutions, the requirements of an embedded AI neural network model for real-time target detection and auxiliary diagnosis on hardware equipment with limited resources are increasing.

Therefore, on the basis of the existing ultrasonic medical image processing technology, how to solve the problems that the existing breast ultrasonic image recognition is low in timeliness and accuracy, and limited by regions, environments and communication, the problems can only be deployed and operated on a large-scale ultra-powerful GPU and CPU operation platform become the problems to be solved by the technicians in the field.

Disclosure of Invention

In view of the above problems, the present invention provides an embedded breast ultrasound image recognition method for at least solving some of the above technical problems, which can effectively reduce the false positive rate and false negative rate of dynamic ultrasound image breast focus recognition, provide references for medical staff, and facilitate the medical staff to determine the illness state more accurately.

The embodiment of the invention provides an identification method of an embedded breast ultrasonic image, which comprises the following steps:

s1, constructing a target detection network model and a target classification network model of a breast focus based on dynamic ultrasonic images, respectively acquiring a target detection network model data set and a target classification network model data set, and training the target detection network model and the target classification network model to obtain a trained target detection network and a trained target classification network model;

S2, cutting the trained target detection network model, and deploying the cut target detection network model and the trained target classification network model into an embedded system as a sub-network to generate an embedded breast ultrasound image recognition network system; the embedded breast ultrasound image recognition network system comprises: a target detection sub-network and a target classification sub-network;

S3, inputting the breast ultrasonic image to be identified into the target detection sub-network for screening, and outputting a screening result; the screening result is a video sequence dynamic image feature vector matrix which is detected positively; the positive detected video sequence dynamic image feature vector matrix comprises the following components: detecting the image feature vector of the display frame positively, and detecting the image feature vector of the video sequence of the first n frames of the display frame positively; the value of n depends on the image definition and the scanning frame rate of the ultrasonic image;

S4, inputting the feature vector matrix of the video sequence dynamic image detected positively into the target classification sub-network, outputting a classification result, and generating a breast ultrasound image recognition result according to the classification result; the classification result is a true positive feature vector or a false positive feature vector.

Further, in the step S1, the object detection network model dataset is acquired by:

respectively collecting breast ultrasonic image data of a clinical breast diagnosis patient, breast position ultrasonic image data and non-breast position ultrasonic image data of normal people, and constructing a target detection network model data set; the ultrasonic image data of the breast of the patient with the clinical breast diagnosis is marked with a target focus, and the ultrasonic image data of the breast part and the ultrasonic image data of the non-breast part of the normal crowd are marked with feature classification.

Further, in the step S1, the object classification network model dataset is obtained by:

Inputting a preset number of collected original mammary gland ultrasonic image data video sequences into the target detection network model for data screening, outputting image feature vectors of positive detection display frames and image feature vectors of the first n frames of video sequences of the positive detection display frames, and constructing a positive detection sample video sequence image feature composite vector matrix; the value of n depends on the image definition and the scanning frame rate of the ultrasonic image; the image feature vector includes: the confidence score of detection, the height and width of the detection frame and the coordinate position of the center point of the detection frame;

Marking each positive detection display frame to generate a video file containing marks;

Acquiring the manual rechecking result of the video file, and if the rechecking result is true positive, writing a first label on the corresponding positive detection sample video sequence image characteristic composite vector matrix; if the rechecking result is false positive, writing a second label on the corresponding positive detected sample video sequence image feature composite vector matrix;

and repeating the above process to obtain a target classification network model data set containing positive and negative samples.

Further, in the step S1, training the target classification network model includes:

the positive detection sample video sequence image feature composite vector matrix is subjected to time feature enhancement and space feature enhancement by learning the space-time variation features of the front n frames of video sequence image feature vectors of the positive detection sample video sequence image feature composite vector matrix; the spatiotemporal variation features include: the scoring change of the confidence, the height and width change of the detection frame and the coordinate position change of the center point of the detection frame;

and training the target classification network model through a data set consisting of the positive detection sample video sequence image feature composite vector matrix after feature enhancement.

And carrying out time feature enhancement and space feature enhancement on the positive detection sample video sequence image feature composite vector matrix, wherein the method comprises the following steps of:

The random transformation enhancement is carried out on the data formed by the positive detection sample video sequence image characteristic composite vector matrix, and the random transformation enhancement comprises the following steps: randomly and synchronously amplifying or reducing the height and the width of the detection frame, and randomly translating the coordinates of the central point of the detection frame;

Further, in the step S2, clipping the trained target detection network model includes:

Normalizing the BN layer of the trained target detection network model, introducing a group of scaling factors into each channel in the BN layer, and adding an L1 norm to constrain the scaling factors;

And scoring each channel in the BN layer according to the scaling factor, and filtering out channels with the scoring lower than a preset threshold value to complete cutting of the target detection network model.

Further, the step S4 includes:

S41, inputting the feature vector matrix of the video sequence dynamic image detected positively into the target classification sub-network, and outputting classification normalization scores;

S42, comparing the classification normalization score with a preset optimal threshold value of the target classification sub-network, and when the classification normalization score is larger than the optimal threshold value, determining that the video sequence dynamic image feature vector matrix detected positively is a true positive feature vector; otherwise, the false positive feature vector is obtained;

S43, generating a breast ultrasound image recognition result according to the true positive feature vector or the false positive feature vector.

Further, the network architecture of the target detection network model is composed of an input end, a backbone network, a neck network and a head prediction network;

The input end carries out data enhancement on an input data set;

The backbone network is composed of a convolutional neural network; the convolutional neural network adopts a focus+ CSPNet +SPP series structure; the Focus performs image slicing on the data set; performing feature extraction on the sliced dataset by CSPNet to generate a feature map; the SPP converts the feature map with any size into feature vectors with fixed sizes;

The neck trunk network adopts an AF-FPN framework and comprises: an adaptive attention module and a feature enhancement module; the neck trunk network aggregates paths of image features and transmits the image features to the head prediction network;

The head prediction network outputs a breast ultrasound image recognition result; the identification result comprises: the target object category, confidence score, bounding box size characteristic information and position coordinate characteristic information of the breast ultrasound image to be identified.

Further, the data enhancement includes:

Performing rotation, left-right inversion, translation, scaling and affine transformation enhancement on the data set;

Randomly adding noise disturbance to the data set, comprising: randomly disturbing the gray value of each pixel of the image in the dataset by adopting spiced salt noise and Gaussian noise;

performing Gaussian blur on the data set;

and carrying out contrast and brightness image enhancement on the data set through Gamma transformation.

Further, the adaptive attention module obtains a plurality of context features of different scales through an adaptive averaging pooling layer; and generating a spatial weight map for each feature map output by the backbone network through a spatial attention mechanism, fusing the context features through the weight map, and generating a new feature map containing multi-scale context information.

Further, the characteristic enhancement module is composed of a multi-branch convolution layer and a branch pooling layer;

the multi-branch convolution layer includes: a hole convolution layer, a BN layer and a ReLU activation layer; the multi-branch convolution layer provides a corresponding receptive field for an input feature map through cavity convolution;

The branch pooling layer fuses image characteristic information from the receptive field.

Further, the target classification network model adopts a logistic regression classifier model; inputting the feature vector matrix of the video sequence dynamic image detected by positive into the trained target classification network model, and synchronizing to the current positive detection display frame; and carrying out time feature enhancement and space feature enhancement on the image of the current positive detection display frame by utilizing the space-time feature information of the video sequence in the preset time period of the positive detection video sequence dynamic image feature vector matrix, carrying out true positive and false positive classification discrimination on the current positive detection display frame, and outputting a true positive feature vector or a false positive feature vector.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

The embodiment of the invention provides an identification method of an embedded breast ultrasonic image, which comprises the following steps: constructing a target detection network model and a target classification network model of a breast focus based on dynamic ultrasonic images, respectively acquiring a target detection network model data set and a target classification network model data set, and training the target detection network model and the target classification network model; cutting the trained target detection network model, deploying the cut target detection network model and the trained target classification network model into an embedded system as a sub-network to generate an embedded breast ultrasound image recognition network system; inputting the breast ultrasonic image to be identified into a target detection sub-network for screening, and outputting a screening result; and inputting the feature vector matrix of the video sequence dynamic image detected positively into a target classification sub-network, outputting a classification result, and generating a breast ultrasound image recognition result according to the classification result. The method effectively reduces the false positive rate of the network model on the premise of maintaining lower missed diagnosis rate of the network, provides reference for medical staff, facilitates the medical staff to judge the illness state more accurately, effectively helps the medical staff to reduce missed diagnosis, and improves the diagnosis efficiency of the medical staff.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flowchart of a method for identifying an embedded breast ultrasound image according to an embodiment of the present invention;

fig. 2 is a diagram of a basic network structure according to an embodiment of the present invention;

FIG. 3 is a diagram of a target detection network model according to an embodiment of the present invention;

FIG. 4 is a diagram of an AF-FPN architecture provided in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a classification training result according to an embodiment of the present invention;

Fig. 6 is a flowchart of a method for pruning a target detection network through a structured network channel according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention provides an identification method of an embedded breast ultrasound image, which is shown by referring to fig. 1 and comprises the following steps:

S1, constructing a target detection network model and a target classification network model of a breast focus based on dynamic ultrasonic images, respectively acquiring a target detection network model data set and a target classification network model data set, and training the target detection network model and the target classification network model to obtain a trained target detection network model and a trained target classification network model;

S3, inputting the breast ultrasonic image to be identified into a target detection sub-network for screening, and outputting a screening result; the screening result is a video sequence dynamic image feature vector matrix detected positively or a video sequence dynamic image feature vector matrix detected negatively;

S4, inputting the feature vector matrix of the video sequence dynamic image detected positively into a target classification sub-network, outputting a classification result, and generating a breast ultrasonic image recognition result according to the classification result; the classification result is a true positive feature vector or a false positive feature vector.

The identification method of the embedded breast ultrasound image can effectively improve the working efficiency of breast ultrasound auxiliary examination, provide reference for medical staff, facilitate more accurate judgment of the illness state of the medical staff, effectively help the medical staff reduce missed diagnosis and improve the diagnosis efficiency of the medical staff. On the premise of maintaining lower network missed diagnosis rate, the false positive rate of the network model is effectively reduced.

The method provided in this embodiment will be described in detail below:

Step one: and respectively constructing a target detection network data set and a target classification network data set.

A target detection network dataset is constructed. The dataset is made up of negative sample data and positive sample data. And collecting ultrasonic image data of the breast parts and ultrasonic image data of non-breast parts of normal people in various age groups, performing desensitization treatment on the data, and performing feature classification labeling on the ultrasonic image data of the breast parts to serve as negative sample data of a target detection network. And collecting ultrasonic video recording and focus image data of the breast of a patient with clinical breast diagnosis, and performing desensitization treatment on the data, and then performing target focus labeling on target breast focus images to serve as positive sample data of a target detection network.

A target classification network dataset is constructed. Comprising the following steps:

Firstly, inputting an original breast ultrasound image data video sequence collected in real time into a trained target detection network for screening, and outputting image feature vector information { t ₀,c₀,w₀,h₀,x₀,y₀ } detected as a positive sample by the target detection network, wherein: the values of the target object category t, the target object positive confidence c, the normalized parameter width w, the normalized parameter height h and the center point coordinates (x, y) of the target object detection frame, and the video sequence image characteristic vector information {t_-1,c_-1,w_-1,h_-1,x_-1,y_-1;...;t_-n,c_-n,w_-n,h_-n,x_-n,y_-n},n of the frame front n frames of the current positive image (the image frames are acquired according to the acquisition mode of the front n frames of video) depend on the definition and the scanning frame rate of the collected ultrasonic image, and n is more than or equal to 0 and less than or equal to 10. Secondly, constructing a positive detection video sequence image feature composite vector matrix (simply called feature composite vector matrix) of (n+1) x6 as ：{t₀,c₀,w₀,h₀,x₀,y₀;t_-1,c_-1,w_-1,h_-1,x_-1,y_-1;...;t_-n,c_-n,w_-n,h_-n,x_-n,y_-n},, writing the matrix into a CSV file according to n+1 rows and 6 columns, and finally generating target classification network sample data.

Data enhancement is performed on the generated target classification network data set:

The method comprises the steps of carrying out random transformation enhancement on data formed by a characteristic composite vector matrix, including carrying out random synchronous amplification or reduction on the height and the width of a detection frame, carrying out random translation on the center point coordinates of the detection frame and carrying out sparsification on the confidence scores.

Training the target classification network by training data constructed by a positive detection video sequence image feature composite vector matrix, enhancing the image features of the detected positive image frames by the image feature vectors of the previous n frames of video sequence images of the current positive image, and comprising the following steps: training network to learn the characteristics of the confidence score change and the position and the size change of the position in which the position is in the past n frames of video sequence change, and performing time and space characteristic enhancement on the positive detected image characteristics, so that the defect of incomplete single frame image characteristic information is overcome.

The n-frame video acquisition mode before the frame of the detected positive image is set in advance according to the scanning speed (frame rate) of the video of the ultrasonic imaging equipment, and can be divided into 3 grades, namely: the frame rate is less than or equal to 50fps, the frame rate is less than or equal to 150fps, the frame rate is less than or equal to 50fps, the frame rate is greater than or equal to 150fps, and the frame rate is greater than or equal to 150 fps.

When the mammary gland ultrasonic image video sequence is input, marking each detected positive frame, and generating and storing a video file containing the marks. The video file is manually rechecked by an experienced sonographer, and sample data of true positive or false positive is screened out, if true positive, a first label (for example, label 1) is written on the corresponding feature composite vector matrix, and otherwise, a second label (for example, label 0) is written. Repeating the above process to obtain a batch of data sets containing positive and negative samples, and recording the data sets in the CSV file. These datasets will be used for training, testing and optimization of the target classification network model.

Step two: the basic network framework is constructed and consists of an object detection network and an object classification network.

Referring to fig. 2, the base network is composed of an object detection network and an object classification network. Referring to fig. 3, the object detection network architecture adopts a modified YOLOv as an algorithm model architecture of the object detection network, and the YOLO architecture has the characteristics of low requirements on hardware equipment and low calculation cost, and the network architecture includes: an Input (Input) section, a Backbone network (Backbone) section, a neck network (Neck) section, and a Head Prediction network (Head Prediction) section.

The input end part enriches the data set by splicing various data enhancement means, and enhances the image characteristics and generalization capability. The backbone network part is used for aggregating and forming a convolutional neural network of training image features on different image fine granularity, and adopts a focus+ CSPNet +SPP series structure, the Focus performs image slicing operation, CSPNet (cross-stage local network) performs feature extraction, and SPP (spatial pyramid pooling network) converts feature images with any size into feature vectors with fixed sizes, so that the receptive field of the network is increased. The neck dryer network portion uses the FPN + PANet framework to effect path aggregation of image features and pass the image features to the prediction layer. The head prediction network part of the last stage realizes a target prediction result, and generates and outputs the target class, the confidence level, the size of the boundary frame and the position coordinate characteristic information.

The built target detection network can realize dynamic identification and positioning of breast lesions: the method for constructing the target classification network data set by adopting the first step is similar to the method for quickly identifying and positioning the input ultrasonic image video image frames or images, and a dynamic positive detection video sequence image feature vector matrix (dynamic feature vector matrix ){t_-1,c_-1,w_-1,h_-1,x_-1,y_-1;...;t_-n,c_-n,w_-n,h_-n,x_-n,y_-n},, wherein the image feature vectors { t0, c0, w0, h0, x0, y0} of the current positive detection display frame and the image feature vector information {t_-1,c_-1,w_-1,h_-1,x_-1,y_-1;...;t_-n,c_-n,w_-n,h_-n,x_-n,y_-n} of the previous n frames form a dynamic feature vector matrix of [ n+1 ] x6 ], the matrix has space and time enhancement characteristics, and the defect of incomplete single frame image feature information is overcome.

The dynamic feature vector matrix is an output result dynamically generated in the running and reasoning process of the deployed target detection network, the feature vector is a dynamic positive detection result generated by the target detection network through detecting the input dynamic video, and the result is immediately input into a following target classification network to carry out classification operation and is in an on-line running state (namely, after training is finished, the device is particularly used).

The "feature composite vector matrix" refers to the feature vector data obtained by offline operation (i.e., used during training), which is used to train the target classification network, by inputting specific ultrasound video data into the target detection network, thereby generating positive detected feature vectors.

The constructed target detection network can also be used as a data screening tool to generate a characteristic composite vector matrix for forming a data set of the target classification network.

Yolov5s and Yolov m are selected as two most basic network model frameworks, and are convenient to apply to different edge computing hardware platforms and embedded systems (including small edge computing imaging devices or apparatuses) due to the fact that the network model frameworks have smaller network scales and super-strong detection instantaneity. YOLOv5 is selected as a main reason for the architecture of the object detection network model:

1) Conventional CNNs typically require a large number of parameters and floating point operations (FLOPs) to achieve satisfactory accuracy, e.g., resNet-50 have about 25.6 tens of thousands of parameters, requiring 41 hundred million floating point operations to process 224 x 224 size images. However, mobile devices with limited memory and computational resources (e.g., portable ultrasound devices) cannot employ the deployment and reasoning of larger networks. YOLOv5 is used as a single-stage detector, and has the remarkable advantages of small calculated amount, high recognition speed and the like.

2) YOLOv5 can implement real-time target detection, integrate target region prediction and target class prediction into a single-stage network architecture, make its network architecture relatively simple, reasoning and detection speed faster and network model scale smaller, thus can be deployed in the embedded system based on edge computing hardware platform with limited resources (AI computing power and storage space).

For the target detection network, a stricter network recall rate (true positive rate) strategy, namely a lower missed diagnosis rate, is adopted, and in order to effectively reduce the false positive rate of the system, a target classification network is added behind the target detection network (YOLOv), so that positive samples detected by the previous stage YOLOv5 are further classified, namely which are true positive detection and which are false positive detection.

The method is characterized in that a target classification network is built, a Logistic classifier is adopted as a target classification network model, the classifier is a classifier modeled by using Bernoulli distribution as a model, and the Logistic classifier has the advantages of simple structure, small operand (high speed), easiness in realization and the like, and has very small additional calculation force requirement on a system.

The network algorithm model of the target classification network adopts a logistic regression classifier model, the classifier adopts a target feature enhancement classification prediction method based on a video sequence, the feature vector matrix information of the video sequence dynamic image which is output by a target detection network and is detected positively is synchronized to a current display frame, and the space-time feature information of the previous section of video sequence is utilized to carry out feature enhancement operation reasoning on the current display frame image, so that true positive and false positive classification discrimination is further carried out on the current positive detection sample.

The video sequence image feature method, namely the video sequence feature composite enhancement method, namely the video sequence-based target feature enhancement classification prediction method, is adopted to synchronize the information of a positive detection sample video sequence feature matrix (namely a dynamic feature vector matrix) output by a target detection network (YOLOv) to a current display frame (positive detection image frame), and the target classification network performs feature enhancement operation by utilizing the current image feature of the input display frame and the space-time feature information of a previous section of video sequence, for example: the method is characterized by enhancing the space-time change characteristics, background change characteristics, confidence change characteristics and the like of the position category, position and size of the breast focus, and further classifying and judging true positives and false positives of the currently displayed positive detection image frame, and the specific principle is as follows:

The input of the classifier is a positive detection sample video sequence feature matrix X output by a previous stage target detection network (YOLOv), and the mathematical model of the classifier is as follows:

Wherein h _θ (x) is a network prediction function, θ is a network weight parameter vector, and x is an input vector. And (3) obtaining a normalized score after calculation, comparing the score with a preset optimal threshold (a first preset threshold) by the target classification network, wherein the preset optimal threshold is obtained through training and optimizing the early classification network, and if the score is larger than the first preset threshold, judging that the positive detection sample is a true positive sample, otherwise, judging that the positive detection sample is a false positive sample.

The YOLOv network's determination of a positive detection is determined based on whether the confidence level of the image feature is greater than a threshold set by the system. The feature confidence of the image is typically not too high for most false positive detection samples, many falling around the threshold. The characteristics of the current display frame image in space and time dimensions are enhanced by combining the image characteristics (position category, confidence level, size and position of a target object frame) of a plurality of frame video sequences before the image frame, so that the multi-dimensionality of threshold judgment is realized. The same video sequence image characteristic method is adopted to train and optimize the target classification network model, so that the target classification network model has the function of comparing and analyzing the spatial and time characteristic changes of the input multi-dimensional image characteristic vector, for example: space-time change characteristics, background change characteristics, confidence change characteristics and the like of the position and the size of the target object, and the secondary interpretation of the current positive detection sample is enhanced, so that the false positive rate of the system is reduced (on the basis of ensuring higher true positive rate of the system).

Step three: aiming at the constructed target detection network framework, an improved characteristic pyramid model method is adopted, and the recognition and positioning capability of the network model on the multiscale change of the breast focus in continuous multi-frame images is improved.

Referring to fig. 4, a modified feature pyramid model AF-FPN, namely an Adaptive Attention Module (AAM) and a Feature Enhancement Module (FEM), is used to replace the original Feature Pyramid Network (FPN) in the YOLOv network architecture neck trunk network portion.

The design of the neck stem is to better utilize the characteristics extracted by the Backbone, and reprocess and reasonably use the characteristic diagram extracted by the Backbone at different stages. In the original YOLOv neck skeleton, a Feature Pyramid Network (FPN) and a path aggregation network (PANet) are adopted to aggregate image features of the network, the FPN is a commonly used multi-layer feature fusion method, the use of the FPN can cause the network to pay attention to optimization of the bottom layer features of the network, sometimes the detection precision of the network to various scale change targets is reduced, and the detection precision of the multi-scale targets is difficult to improve while the detection instantaneity is guaranteed. While breast lesions and lesions have quite different size scales and visual characteristics from breast distension and nodules to various breast cancers, there is a need to employ a network model that has a strong multi-scale target recognition capability and that can effectively trade off between recognition speed and accuracy. By adopting the improved AF-FPN framework, channel information can be reserved to a great extent in the characteristic transfer process through self-adaptive characteristic fusion and receptive field enhancement, and different receptive fields in each characteristic graph are self-adaptively learned, so that the representation of a characteristic pyramid is enhanced, and the accuracy of multi-scale target identification is effectively improved, therefore, the improved method improves the detection performance and speed of YOLOv network on multi-scale targets on the premise of ensuring real-time detection.

The AAM self-adaptive attention module obtains a plurality of context characteristics of different scales through a self-adaptive average pooling layer, the pooling coefficient is [0.1,0.5], the self-adaptive change is carried out according to the target size of the data set, then a space weight figure is generated for each characteristic figure through a space attention mechanism, the context characteristics are fused through the weight figures, and a new characteristic figure containing multi-scale context information is generated. The self-adaptive attention module reduces feature channels and reduces the loss of context information in the high-level feature map.

The FEM characteristic enhancement module mainly utilizes cavity convolution to adaptively learn different receptive fields in each characteristic graph according to different scales of the detected target object, so that the accuracy of multi-scale target detection and recognition is improved. It can be divided into two parts: a multi-branch convolution layer and a branch pooling layer. Referring to fig. 4, the multi-branch convolution layer is configured to provide receptive fields with different sizes for an input feature map through hole convolution, and at the same time, utilize the average pooling layer to fuse image feature information from the receptive fields of three branches, so as to improve multi-scale precision prediction.

The multi-branch convolution layer comprises a hole convolution layer, a BN layer and a ReLU activation layer. The hole convolutions in the three parallel branches have the same kernel size but different expansion rates. Specifically, the kernel of each hole convolution is 3×3, and the expansion rates d of the different branches are 1,3, 5, respectively.

Extended convolution supports an exponentially extended receptive field without loss of resolution. In contrast, in the convolution operation of the hole convolution, the elements of the convolution kernel are spaced, and the size of the space depends on the expansion rate, which is different from the adjacent elements of the convolution kernel in the standard convolution operation. The feature enhancement module adaptively learns different receptive fields in each feature map by utilizing cavity convolution, so that the accuracy of multi-scale target detection and identification is improved.

Step four: training and optimizing the constructed target detection network and the target classification network.

The five common steps of machine learning network model training are: data, model, loss, optimizer, iteration training, and obtaining the difference between model output and a real label, namely a loss function, through a forward propagation process; the parameter gradient is obtained through the back propagation process, and the parameters of the network optimizer are updated according to the gradient. The optimization method is to optimize the loss function through repeated iterative training, so that the loss is continuously reduced, and the best model is trained.

Training of the object detection network comprises the following steps:

data enhancement is carried out on a data set input by the target detection network;

Initializing weight values;

evaluating and optimally training a loss function by adopting an adaptive moment estimation algorithm;

And (3) adopting a preheating learning rate method for the network training learning rate strategy, carrying out learning rate iteration and updating by adopting one-dimensional linear interpolation in a preheating stage, and adopting a cosine annealing algorithm after the preheating stage.

The method comprises the specific steps of initializing weight values, evaluating loss functions, optimally training and setting a training learning rate strategy of the target detection network, wherein the specific steps comprise:

Firstly, initializing a weight of a target detection network (YOLOv), wherein the method comprises the following specific steps: setting initial super parameters of basic network training by adopting a pre-training weight model based on ultrasonic image characteristics; the network super-parameters are optimized by adopting a super-parameter evolution method, and the super-parameter evolution is a method for optimizing the super-parameters by utilizing a Genetic Algorithm (GA). YOLOv5 there are about 25 hyper-parameters for various training settings, which are saved in yaml file. The correct weight initialization can accelerate the convergence of the model, and the improper weight initialization leads to the overlarge or undersize output of the output layer and finally leads to gradient explosion or disappearance, so that the model cannot be trained.

Secondly, carrying out loss function evaluation and optimization training on the target detection network (YOLOv), wherein the specific steps comprise: the adaptive moment estimation (Adam), namely an adaptive learning rate gradient descent optimization algorithm, is adopted to solve the problems of overlarge swing amplitude and unstable convergence of the training gradient, and the convergence speed of the function is increased. The learning rate of each parameter is dynamically adjusted by using the first moment estimation and the second moment estimation of the gradient.

Finally, setting a training learning rate strategy of the target detection network (YOLOv), wherein the training learning rate strategy comprises the following specific steps: the learning rate adjustment strategy in the network training adopts a preheating learning rate method, adopts one-dimensional linear interpolation to iterate and update the learning rate in the preheating stage, and adopts a cosine annealing algorithm after the preheating stage. When a gradient descent algorithm is used to optimize the Loss function, the learning rate should become smaller as the global minimum of the Loss value is more and more approached to the point as much as possible, while cosine annealing can reduce the learning rate by a cosine function in which the cosine value first descends slowly with increasing x, then descends with acceleration, then descends slowly again, and this descending mode can cooperate with the learning rate to produce a better convergence effect.

Normalizing the boundary frame coordinates in the label text file for network training; in the process of training the network model, an early-stopping mechanism is adopted, namely mAP and Loss values are dynamically monitored, the maximum epoch times are set, and if continuous training exceeds a maximum set value, the training is automatically stopped without improving the performance of the mAP and the Loss.

Training of the object classification network comprises the following steps:

Carrying out data enhancement on a target classification network data set formed by the composite vector matrix based on the image characteristics of the video sequence;

the optimal threshold is determined by optimizing a gradient-up (or down) algorithm, training and optimizing the network function.

Training and optimizing a target classification network by adopting a data set consisting of a characteristic composite vector matrix, and determining an optimal threshold value, wherein the specific steps comprise:

the sample dataset is constructed and the dataset is prepared using the method of constructing the target classification network dataset as described above. And reading a sample data set stored in the CSV file, namely a data set formed by a characteristic composite vector matrix (n+1) x 6), wherein n is more than or equal to 0 and less than or equal to 10, and n is set in advance and depends on the definition of an ultrasonic image and the scanning frame rate. Training and test sets were randomly partitioned at 8:2.

Training and optimizing the network function by optimizing the gradient-up (or down) algorithm, determining the optimal threshold, comprising:

The core of the logistic regression classification network is to solve the classification problem, the logistic regression is to solve the probability maximization, and the mathematical method used is a likelihood function. The mathematical model of the classifier network is:

Wherein h _θ (x) is a network prediction function, θ is a network weight parameter vector, and x is an input vector.

Assume that: the probability when the output vectors y=1 and y=0 is:

P(y＝1|x；θ)＝h_θ(x)

P(y＝0|x；θ)＝1-h_θ(x)

The likelihood function is probability multiplication:

The above equation is also called the likelihood function for m observations; m is the number of observation statistics samples.

The objective is to obtain a parameter estimate that maximizes the value of this likelihood function, i.e., the network weight θ ₀,θ₁,...,θ_n maximizes the above equation, and to obtain the gradient θ by taking the logarithm and derivative of the likelihood function using the gradient algorithm of the optimization algorithm:

the learning rule of θ is:

Wherein j represents the j-th attribute of the sample, and m are total; alpha is learning rate step length and can be set freely.

The optimization method is to obtain the weight value theta after repeated iterative training until the gradient is sufficiently converged. The threshold value of the optimal point is determined by the network ROC curve.

Referring to fig. 5, the results of classification training using the disclosed Wisconsin Breast cancer dataset Wisconsin break CANCER DATASET (569 cases, 212 (M malignancy), 357 (B benign)) are shown. The classification effect is very evident from the figure.

Step five: and the conventional image enhancement technology is removed, and the data set of the target classification network is enhanced by adopting a data characteristic expansion means.

Aiming at a target classification network, the scale of a training and testing data set is expanded by adopting a data characteristic enhancement method, and the specific steps comprise:

1) For example, for a 10×6 feature matrix for a certain set of 10 frames of video, denoted as X:

X＝t-9,c-9,w-9,h-9,x-9,y-9

t-8,c-8,w-8,h-8,x-8,y-8

……

t0,c0,w0,h0,x0,y0

2) Randomizing X:

keeping the 1 st and 2 nd columns of X unchanged, and multiplying the 3 rd and 4 th columns w, h of X, namely the width and the height of the detection frame, by the same random coefficient synchronously; the 5 th row X of X, namely the coordinate of the center point X of the detection frame, is translated by the same value; and (3) translating the y coordinate of the 6 th row of X, namely the y coordinate of the center point of the detection frame by the same value.

Further, the breast ultrasound image has the characteristics of low contrast and resolution, blurred boundary, background noise, artifact interference and the like, except that conventional image enhancement technology is adopted, for example: rotation (Rotation), left-right inversion (Flip), translation (scaling), affine (perspective) transformation, etc., the following image enhancement method is also adopted for the object detection network (YOLOv 5):

1) Randomly replicating a portion of the samples, randomly adding noise disturbance (noise): randomly disturbing the gray value of each pixel of the image by adopting salt and pepper noise and Gaussian noise;

2) Randomly copying part of samples, and realizing Gaussian blur by using a Piclow library;

3) Contrast and brightness image enhancement is achieved through Gamma transformation.

Step six: and cutting and compressing the basic network model by adopting a network pruning technology.

And further cutting, compressing and adapting the basic network model to generate an embedded network algorithm model which can be suitable for an edge computing system or device with limited low-power consumption computing power so as to solve the problem of insufficient computing power of an embedded terminal AI.

In the deep learning landing process, in order to adapt to the problem of insufficient AI calculation force of an embedded end, a deep learning model needs to be compressed, and a pruning technology is one of compression technologies of the deep learning model.

In the breast ultrasound lesion recognition and location system composed of the YOLOv target detection network and the logistic regression classifier, the resources and the operation scale occupied by the classifier are very small and can be ignored, so the network system scale is basically dependent on the scale of the target detection network (YOLOv), as is known, YOLOv5, especially YOLOv s, is a very excellent lightweight target detection network, but sometimes the model is still relatively large, especially for ultrasound images with higher image resolution, if the network is to be deployed into an embedded system based on an edge computing platform, which has low power consumption, low memory and limited AI computing power, the network scale has to be reduced so that the system can run smoothly in the embedded system.

The embodiment adopts a structured network channel pruning (Channel Pruning) method to cut and compress the YOLOv network model. In YOLOv network architecture, bulk normalization layer (BN) and convolution layer are used as a network minimum operation unit for the backbone network part and the neck network part in a large number, although BN layer is used as regularization, and plays positive roles of accelerating convergence and avoiding overfitting during training. However, when the network reasoning is performed, a plurality of layers of operations are added, the performance of the model is affected, and more resource space is occupied.

Referring to fig. 6, a trainable scaling factor is introduced into each channel in the BN layer, the scaling factors are constrained by adding L1 regularization, the scaling factors are thinned by sparse training, the importance degree (score) of the input channels is evaluated by using the scaling factors (i.e., weights) of the BN layer, then the channels with scores lower than a threshold are filtered, i.e., the channels with small sparsity or scaling factors are cut off, then the network after pruning is trimmed, and the process is iterated repeatedly, so that a more compressed and refined model is obtained. The method comprises the following specific steps:

1) Channel sparsity regularization training

Channel thinning (channel-WISE SPARSITY) can be used for clipping and thinning of any classical CNN or full-connection layer, thereby improving the reasoning speed of the network. The network objective loss function is defined here as:

Where (x, y) represents training data and labels, W is a trainable parameter of the network, the first term above is a loss function of normal training of the network, the second term is a constraint term, λ is a regularization coefficient, the larger λ, the larger the constraint, if g (γ) = |γ| is selected, i.e., L1 regularization (L1 norm), which is widely applied to sparsification. Gamma is a scaling factor which is multiplied by the inputs of the channels, the network weights and the scaling factors are jointly trained by adopting a gradient descent optimization algorithm, the scaling factors are constrained by adding an L1 norm, and the L1 norm can enable the partial value of the trained scaling factors (weights) to approach 0, so that the weights are more sparse, the sparse channels can be cut off, the network can automatically identify unimportant channels and then remove the unimportant channels, and the precision is hardly lost. The channels with small sparsity or small scaling factors are cut out through sparsity training.

In YOLOv network architecture, a batch normalization layer (BN) and a convolution layer are used in large numbers for the backbone network part and the neck network part as one network minimum operation unit. The conversion formula of BN layer is as follows:

Wherein z _in and z _out are the channel input activation vector value and output activation value of the BN layer, μ and σ are the mean and variance of all small-batch input activation feature (mini-batch input feature) vectors, γ and β are the scaling factor and offset function of the corresponding activation channel, respectively, and γ represents the activation degree of the corresponding channel. As can be seen from the conversion formula, BN is normalized first, namely, the average value is subtracted from the small-batch input feature vector to divide the standard deviation, and finally, affine transformation is carried out by utilizing the learnable parameters gamma and beta, so that the final BN output can be obtained. The scaling factor method of the normalized activated channel of BN can be just combined with the channel scaling factor concept item, and the scaling factor gamma in the BN layer is used as an importance factor, namely, the smaller the gamma is, the smaller the corresponding activation is, so that the influence on the back is very small, the corresponding channel is less important, and the channel can be cut out.

2) Model pruning and trimming

The BN obtains a model with more sparse scaling factors (or weights) after sparse regularization training, a pruning threshold is obtained according to a preset pruning rate, the scaling factor of a channel smaller than the pruning threshold is set to 0, namely, the pruning is performed, then a pruned network is finely adjusted, and the process is iterated repeatedly, so that a more compressed and refined target detection network model is obtained.

Where many weights are close to 0, assuming a preset pruning rate P (percentage), a pruning threshold can be obtained:

θ=sortp (M), where M is an importance score, m= { γ1, γ2,..; sortp () is to sort the objects in ascending order and take the number output of P positions. The value of the pruning rate P is determined according to the specific hardware platform calculation force, and is usually 40-70%. If the clipping rate is 70%, the value of the 0.7 fraction in the M list is the clipping threshold, and the scaling factor γ of the channel smaller than the clipping threshold is set to 0, so that a more compact network model with fewer parameters and smaller memory and calculation power requirements can be obtained. Usually, after larger pruning, the accuracy of the model may be reduced, and the accuracy may be basically recovered by properly fine-tuning several training rounds (epochs).

According to the identification method of the embedded breast ultrasound image, the established basic network model is cut and compressed through the pruning technology, so that the embedded network algorithm model which can be suitable for the edge computing system or device with limited low-power-consumption AI computing power is generated, and the embedded system can be widely deployed and operated on various small-sized low-power-consumption and low-cost portable ultrasound image equipment or devices, so that the operation threshold and the technical threshold of breast ultrasound auxiliary screening and initial screening work are greatly reduced.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The identification method of the embedded breast ultrasound image is characterized by comprising the following steps of:

In the step S1, training the target classification network model includes:

training the target classification network model through a data set consisting of the positive detection sample video sequence image feature composite vector matrix after feature enhancement;

S4, inputting the feature vector matrix of the video sequence dynamic image detected positively into the target classification sub-network, outputting a classification result, and generating a breast ultrasound image recognition result according to the classification result; the classification result is a true positive feature vector or a false positive feature vector;

The target classification network model adopts a logistic regression classifier model; inputting the feature vector matrix of the video sequence dynamic image detected by positive into the trained target classification network model, and synchronizing to the current positive detection display frame; and carrying out time feature enhancement and space feature enhancement on the image of the current positive detection display frame by utilizing the space-time feature information of the video sequence in the preset time period of the positive detection video sequence dynamic image feature vector matrix, carrying out true positive and false positive classification discrimination on the current positive detection display frame, and outputting a true positive feature vector or a false positive feature vector.

2. The method for identifying an embedded breast ultrasound image according to claim 1, wherein in step S1, the target classification network model dataset is obtained by:

3. The method for identifying an embedded breast ultrasound image according to claim 1, wherein in step S2, the trained object detection network model is cut, which comprises:

4. The method for identifying an embedded breast ultrasound image according to claim 1, wherein the step S4 comprises:

5. The method for identifying an embedded breast ultrasound image according to claim 1, wherein the network architecture of the target detection network model is composed of an input end, a backbone network and a head prediction network;

The input end carries out data enhancement on an input data set;

The head prediction network outputs a breast ultrasound image recognition result; the identification result comprises: object class, confidence score, bounding box size feature information, and location coordinate feature information.

6. The method of claim 5, wherein the data enhancement comprises:

performing Gaussian blur on the data set;

7. The method of claim 5, wherein the adaptive attention module obtains a plurality of context features of different scales through an adaptive averaging pooling layer; and generating a spatial weight map for each feature map output by the backbone network through a spatial attention mechanism, fusing the context features through the weight map, and generating a new feature map containing multi-scale context information.

8. The method for identifying an embedded breast ultrasound image according to claim 5, wherein the feature enhancement module is composed of a multi-branch convolution layer and a branch pooling layer;