WO2024001123A1 - Image recognition method and apparatus based on neural network model, and terminal device - Google Patents

Image recognition method and apparatus based on neural network model, and terminal device Download PDF

Info

Publication number
WO2024001123A1
WO2024001123A1 PCT/CN2022/142418 CN2022142418W WO2024001123A1 WO 2024001123 A1 WO2024001123 A1 WO 2024001123A1 CN 2022142418 W CN2022142418 W CN 2022142418W WO 2024001123 A1 WO2024001123 A1 WO 2024001123A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
image
features
feature
convolution
Prior art date
Application number
PCT/CN2022/142418
Other languages
French (fr)
Chinese (zh)
Inventor
张号逵
胡文泽
王孝宇
Original Assignee
深圳云天励飞技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术股份有限公司 filed Critical 深圳云天励飞技术股份有限公司
Publication of WO2024001123A1 publication Critical patent/WO2024001123A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • the present application belongs to the field of computer vision technology, and in particular relates to image recognition methods, devices, terminal equipment and computer-readable storage media based on neural network models.
  • image classification is the main support for target detection and semantic segmentation.
  • Target detection as the core of computer vision tasks, is the basis for scene understanding and cognition, such as in face recognition, pedestrian tracking, automatic
  • scene understanding and cognition such as in face recognition, pedestrian tracking, automatic
  • semantic segmentation paves the way for complete understanding of the scene.
  • More and more applications obtain knowledge from image data through semantic segmentation. These include applications such as autonomous driving, human-computer interaction, virtual reality, and medical image analysis. Therefore, research on image recognition tasks such as image classification and target detection has become a hot topic in the field of computer vision.
  • a lightweight model refers to a model that can run smoothly on terminal devices with low computing power and has low computing overhead. Since convolutional neural networks have the characteristics of many parameters and large amounts of calculations, and terminal devices such as embedded and mobile terminal devices have limited computing power and storage capacity resources, the lightweighting of neural network models has become a research hotspot in recent years.
  • the currently deployed neural network models used for image recognition tasks are mainly convolutional neural network models.
  • Image recognition methods based on convolutional neural networks mostly rely on deep network structures to improve detection accuracy. Compressing the convolutional neural network model makes it Achieving lightweight will also affect the recognition accuracy of the network, making the accuracy of lightweight neural network models relatively reduced.
  • Embodiments of the present application provide image recognition methods, devices and terminal equipment based on neural network models, which are beneficial to realizing lightweight models while improving the accuracy of image recognition.
  • embodiments of the present application provide an image recognition method based on a neural network model.
  • the above-mentioned neural network model extracts the global features of the image to be recognized through a meta former structure based on pure convolution.
  • the above-mentioned image recognition method includes:
  • the image to be recognized is input into the trained neural network model, and the features of the image to be recognized are sequentially extracted and recognized through the neural network model to obtain the recognition result.
  • inventions of the present application provide an image recognition device.
  • the image recognition device includes:
  • Input module and trained neural network model.
  • the above neural network model extracts the global features of the image to be recognized based on the purely convolutional Meta Former structure.
  • the above-mentioned input module is used to input the image to be recognized into the trained neural network model.
  • the above neural network model is used to sequentially perform feature extraction and recognition on the image to be recognized to obtain recognition results.
  • embodiments of the present application provide a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, the first step is implemented.
  • embodiments of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program is executed by a processor, the above-mentioned neural network model-based method in the first aspect is implemented. Image recognition method steps.
  • embodiments of the present application provide a computer program product that, when run on a terminal device, causes the terminal device to execute the image recognition method based on a neural network model described in any one of the above first aspects.
  • the beneficial effects of the embodiments of the present application are as follows: the image to be recognized is input into the trained neural network model, and the features of the image to be recognized are sequentially extracted and recognized through the neural network model, thereby obtaining the recognition
  • the above neural network model extracts the global features of the image through the convolution-based Meta former structure
  • the above neural network model is more effective in identifying the image to be recognized. It can make the above-mentioned neural network model pay attention to the global features of the image to be recognized, have more features, reduce the problem of accuracy decline caused by lightweight convolutional neural networks, and help improve the image quality of the neural network model while realizing a lightweight model. Recognition accuracy, thereby improving the accuracy of the neural network model.
  • Figure 1 is a schematic flow chart of an image recognition method based on a neural network model provided by an embodiment of the present application
  • Figure 2 is a schematic structural diagram of a global feature extraction module provided by an embodiment of the present application.
  • Figure 3 is a schematic structural diagram of a global feature extraction module provided by an embodiment of the present application.
  • Figure 4 is a schematic flow chart of global feature extraction by a global feature sub-extraction module provided by an embodiment of the present application
  • Figure 5 is a schematic structural diagram of a neural network model used for image classification tasks provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a neural network model used for target detection tasks provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of the detection frame of the target detection task provided by the embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a neural network model used for semantic segmentation tasks provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of the convolution branch (segmentation module) of the semantic segmentation sub-model provided by the embodiment of the present application.
  • Figure 10 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application.
  • Figure 11 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • Figure 1 shows a schematic flow chart of an image recognition method based on a neural network model provided by an embodiment of the present invention. The details are as follows:
  • the image to be recognized is input into the trained neural network model, and the features of the image to be recognized are extracted and recognized sequentially through the above neural network model to obtain the recognition result.
  • the above neural network model is constructed based on a lightweight convolutional neural network, and is based on a pure The convolutional Meta Former structure extracts the global features of the image to be recognized.
  • the above Meta Former structure is a derived model of the VIT model for computer vision tasks.
  • the VIT model uses the Transformer model based on self-attention for image tasks, and realizes the self-attention mechanism through matrix operations.
  • the VIT model has the same or even better effects and is more cost-effective than the convolutional neural network on large data sets.
  • the application of the VIT model in image tasks requires a large amount of computing resources and hardware requirements, it makes Its hardware support is not friendly. Therefore, through an improved Meta Former structure based on pure convolution with good hardware support, the global features of the image are extracted for image tasks, and the good hardware support capabilities and low computing overhead of the convolutional neural network can be inherited. At the same time, it integrates the characteristics of the VIT model to extract global features and improves the accuracy of the neural network model.
  • the neural network model when lightweighting the neural network model, since the neural network model relies on the deep network structure to improve image recognition accuracy, compressing the neural network model will make the recognition accuracy of the neural network model lower. Therefore, when building the neural network model, The improved pure convolution-based Meta Former structure is integrated into the neural network model to extract the global features of the image to be recognized, and the recognition results are obtained through feature extraction and image recognition of the image to be recognized through the neural network model containing the Meta Former structure.
  • the features of the image to be recognized are extracted and recognized sequentially based on the trained neural network model to obtain the recognition results.
  • the above neural network model extracts the global features of the image to be recognized based on the purely convolutional Meta Former structure. Since Meta The Former structure performs image recognition tasks by extracting global features and has good image recognition accuracy. It can make the neural network model lightweight while having good image recognition accuracy, thus improving the accuracy of the neural network model.
  • the trained neural network model is used to sequentially perform feature extraction and recognition on the image to be detected, thereby outputting the corresponding image recognition results. Since the above neural network model is based on the global features of the convolutional Meta Former structure of the image to be recognized, Therefore, when performing image recognition tasks, the above-mentioned neural network model can pay attention to the global characteristics of the image to be recognized, thereby improving the image recognition accuracy of the above-mentioned neural network model, avoiding the problem of reduced accuracy of the neural network model caused by lightweighting, thereby improving the accuracy of the neural network model. The recognition accuracy of the neural network model is improved, thereby improving the actual deployment effect of the corresponding application of image recognition based on the above neural network model.
  • the image recognition tasks of the actual deployed applications are Need to train the neural network model.
  • the above neural network model needs to be trained for pedestrian detection or pedestrian re-detection.
  • the above neural network model has the ability to detect and identify pedestrians and their location information.
  • the required standard annotated data set can be obtained from the image database as a training set, or an unlabeled data set can be obtained as a training set.
  • the verification set can also be used to verify the generalization ability of the above-mentioned neural network model to evaluate the ability of the above-mentioned neural network model and decide whether to stop training, and/or evaluate the above-mentioned neural network through the test set The generalization ability of the model.
  • the neural network model is trained accordingly according to the actual needs of the actual deployment application, so as to more accurately realize the image recognition tasks for different deployment applications.
  • the above-mentioned image to be recognized may be an image captured by a camera device, or may be an image frame in a video stream captured by a camera device.
  • the corresponding collection methods and collection rules are obtained according to the corresponding collection methods and collection rules in each application field. of images to be recognized. For example, in applications in the transportation field, such as the detection of traffic violations, it is necessary to collect real-time video through a camera installed at a fixed position, and obtain continuous image frames of the video stream as images to be detected to perform corresponding image recognition tasks.
  • corresponding collection methods and collection rules are used to obtain images to be recognized that meet the requirements of the image recognition task, so as to perform the image recognition task.
  • the neural network model of the above steps includes a feature extraction module and a recognition module, and the above feature extraction module includes a global feature extraction module built based on the convolutional Meta Former structure;
  • the above-mentioned image to be recognized is recognized through the above-mentioned neural network model, and the recognition results are obtained, including:
  • image recognition tasks include image classification, target detection, semantic segmentation, instance segmentation, etc.
  • the feature extraction module is used to extract features from the image to be recognized, and the extracted features are identified based on the features required for each image recognition task and the corresponding recognition method, and the corresponding recognition results are obtained.
  • pedestrian detection is applied in the fields of intelligent driving, intelligent monitoring, pedestrian analysis, and intelligent robots to determine whether the input picture or video image frame contains pedestrians.
  • feature extraction is performed on the image to be recognized based on the feature extraction module. Since each image recognition task requires different features of the image to be recognized and methods for identifying the features during recognition, therefore, through the recognition module Obtain the required features of the corresponding image recognition task, identify the corresponding features, and obtain the corresponding recognition results to improve the recognition accuracy and efficiency of each image recognition task.
  • the above feature extraction module further includes a first convolution module
  • step A1 includes:
  • A11 Extract local features of the image to be recognized based on the first convolution module to obtain a first local feature image.
  • the above-mentioned first convolution module is constructed based on a lightweight convolutional neural network structure to reduce the calculation amount of the first convolution module.
  • a lightweight convolutional neural network structure For example, build on lightweight network structures such as MobileNetV2 or SqueezeNet.
  • MobileNetV2 is a lightweight neural network that uses depthwise separable convolution to replace ordinary convolution, and introduces a linear bottleneck to improve the expression ability of the model and avoid problems caused by nonlinear transformation. Feature information is lost, and the feature map channel is expanded through the inverted residual structure to avoid the problem of gradient disappearance or explosion, which enriches the number of features and thereby improves accuracy.
  • SqueezeNet is a simplified lightweight convolutional neural network structure. It customizes its own convolution module to compress and expand the number of channels of data respectively, and further compresses parameters through Deep Compression to achieve ultra-lightweight. Level effect, suitable for terminal devices with weak computing power.
  • depth-separable convolution refers to dividing general convolution into two steps: channel-wise convolution and point-wise convolution.
  • Depthwise separable convolution can be defined by two independent layers: lightweight depth convolution for spatial filtering and 1*1 point convolution for feature generation.
  • the image is convolved channel-by-channel through lightweight depth convolution.
  • one convolution kernel is responsible for one channel, performs a convolution operation on each channel without changing the depth of the input feature image, and obtains an output feature map with the same number of channels as the input feature map, and performs a 1*1 convolution on
  • the output feature map is dimensioned up and down, weighted and combined in the depth direction, and the feature information of each channel is combined without changing the size of the feature map.
  • the first local feature image extracted by the above-mentioned first convolution module is used as the input of the global feature extraction module, and the global features are extracted from the obtained first local feature image through the above-mentioned convolution-based Meta Former structure. , obtain the global features of the image to be recognized.
  • the lightweight network structure is used to extract the first local feature, which reduces the parameters and calculation amount of the neural network model, thereby speeding up the image recognition rate of the neural network model, and extracts it through the convolution-based Meta Former structure.
  • Global features so that subsequent image recognition tasks can be carried out based on global features, reducing the impact of the lightweight network structure on the accuracy of the neural network model, thereby improving the recognition accuracy.
  • the above-mentioned feature extraction module also includes a second convolution module and a fusion module, and the above-mentioned second convolution module uses a lightweight convolutional neural network to extract the second local feature;
  • step A1 also includes:
  • the above-mentioned second convolution module is constructed based on a lightweight convolutional neural network structure such as MobileNetV2, SqueezeNet, etc., to reduce the calculation amount of the second convolution module, thereby extracting the second local feature. Its principle is consistent with the above-mentioned first convolution module and will not be described again here.
  • the above second local feature and the above global feature are fused to obtain the fusion feature.
  • the above-mentioned second local features and global features are connected in the channel direction to obtain fusion features describing the local features and global features, so as to improve the information representation ability of the feature image, thereby improving the accuracy of subsequent image recognition.
  • the fusion feature that can represent the local feature and the global feature is obtained, which improves the information representation ability of the feature image, so that corresponding image recognition can be performed subsequently based on the fusion feature.
  • relatively accurate image recognition results are obtained, thereby improving the recognition accuracy of the neural network model.
  • the above-mentioned global feature extraction module includes a first residual module, a global feature sub-extraction module, a first addition module, a second residual module, a feedforward network module and The second addition and merging module;
  • step A12 specifically includes:
  • the above-mentioned first addition module is used for adding and merging the input data of the above-mentioned first residual module and the output data of the above-mentioned global feature extraction module;
  • the above-mentioned second addition module is used to add and combine the input data of the above-mentioned second residual module and the output data of the above-mentioned feedforward network module;
  • the above-mentioned global feature sub-extraction module includes a first branch, a second branch and a merging module;
  • the above-mentioned first branch is used to extract local features of the input image through depth-separable convolution
  • the above-mentioned second branch is used to extract global features from the input image
  • the above-mentioned merging module is used to merge the features output by the above-mentioned first branch and the above-mentioned second branch according to the pixel position to obtain global features.
  • the above-mentioned first residual module and the above-mentioned second residual module refer to using a residual structure based on the above-mentioned global feature extraction module and the above-mentioned feedforward network module respectively to solve the problem of gradient explosion and network performance degradation.
  • the network structure of the above-mentioned global feature extraction module is shown in Figure 2.
  • the above-mentioned first local feature image is used as the input of the first residual module, and the first branch of the above-mentioned global feature sub-extraction module is used to input the first Perform local feature extraction on the local feature image, and perform global feature extraction on the input first local feature image based on the second branch of the above-mentioned global feature sub-extraction module, and use the above-mentioned merging module to combine the features output by the above-mentioned first branch and the above-mentioned second branch. Perform merging processing to obtain the global features of the image to be recognized.
  • the above-mentioned first branch performs local feature extraction on the input first local feature image based on depth-separable convolution to reduce the amount of parameters and operation cost.
  • the first branch uses convolution with a convolution kernel size of 3*3 and a convolution stride of 1.
  • the input edge is padded with a circle of zeros to keep the resolution of the convolved image unchanged, that is, the above first branch Output a feature map with the same size as the input first local feature image.
  • the above-mentioned first branch can also use separable convolution, grouped convolution or other conventional convolutions to extract local features from the input first local feature image.
  • the above-mentioned convolution operations are conventional convolution operations and will not be described in detail here.
  • the merging module when the above-mentioned merging module combines the features output by the first branch and the second branch, the features output by the above-mentioned first branch and the second branch are added based on the pixel position to obtain global features without adding features.
  • the global feature describes more information, that is, it contains feature information of both local features and global features.
  • local features are extracted through depth-separable convolution, and the local features and the extracted global features are added to obtain a global feature image containing local features. Since the local features and global features are merged through addition, The dimension of the obtained feature map, that is, the number of channels remains unchanged, but the amount of information it describes increases, thereby improving the accuracy of subsequent image recognition without increasing the amount of calculation, thereby improving the accuracy of the neural network model.
  • the above-mentioned global feature extraction module structure is shown in Figure 3.
  • a BN Batch Normalization, batch normalization
  • the normalization layer normalizes the input image to prevent the middle layer data distribution from changing during the training process of the neural network model, causing the problem of gradient disappearance or gradient explosion, and speeding up the training of the neural network model.
  • the above-mentioned second branch is specifically used to:
  • N is a positive integer greater than 1.
  • the input image is divided into N groups, and each group is convolved using a large convolution kernel that can cover all pixels in the group, and the Corresponding to one feature vector, a total of N feature vectors are obtained.
  • each feature vector into N groups along the channel direction, scramble and rearrange each feature vector, and combine to generate new N feature vectors, the channels of the original feature vectors The order is disrupted, allowing feature information to flow in different channels to achieve information exchange and fusion.
  • each of the new N feature vectors generated is composed of N groups from different feature vectors, that is, the N groups that constitute the new feature vector come from N different feature vectors, so that It is guaranteed that the generated new feature vectors contain the feature information of other feature vectors, so that each newly generated feature vector can characterize the characteristics of the entire input image.
  • the obtained new N feature vectors are rearranged to form a sparse feature map, and the parts without actual data in the sparse feature map are filled with zeros to spread the feature vector information to each group.
  • extract local features from the input image based on the depth-separable convolution of the first branch perform the above steps B1-B4 based on the second branch, extract global features from the input image, and compare the extracted local features and global features.
  • Features are merged to obtain global features. For example, as shown in Figure 4, the input image size is 8*8.
  • the input image is subjected to depth separable convolution processing through a convolution kernel with size k of 3*3, step size s of 1, and edge supplement p of 1, and the extraction Local features, at the same time, use every 4*4 pixels as a group to divide the input image into 4 groups, and use a convolution kernel with a size of 4*4, a stride of 4, and an edge complement of 0 to perform a large stride on the input image Convolution processing, obtain 4 feature vectors, perform channel shuffling on the 4 obtained feature vectors, that is, scramble them, recombine them to obtain 4 new feature vectors, and sparse these 4 new feature vectors Process to form a sparse feature map, and then use a convolution kernel with a size of 4*3, a step size of s, and an edge complement of 2 to convolve the sparse feature map, spread the information of the sparse feature map, and obtain dense features map (i.e., global features), and the extracted local features and global features are added and
  • the above-mentioned recognition module includes an image classification sub-model
  • step A2 includes:
  • Image classification refers to distinguishing different categories of targets based on different features reflected in the image.
  • the extracted fusion features into the trained image classification sub-model and output the image classification result.
  • the extracted fusion features of the face image are classified through the image classification sub-model to obtain the category of the face image, that is, Which specific person it belongs to.
  • the extracted fusion features are classified through the trained image classification sub-model to obtain the image classification results. Since the classification processing is based on the fusion features, the fusion features describe more semantic and detailed information. This makes the image classification sub-model more accurate in classifying targets, thereby improving the recognition accuracy of the neural network model.
  • the above image classification sub-model includes a third convolution module
  • step A21 includes:
  • the extracted features are classified and processed to obtain image classification results.
  • the above-mentioned third convolution module constructs a fully connected layer based on convolution to classify the extracted features.
  • the neural network model used for image classification tasks uses the feature extraction module to extract local features of the image to be identified through the first convolution module C1 to obtain the first local features and the second convolution Module C2 and global feature extraction module E are stacked N 1 and M 1 times respectively to extract the first local feature image features, and fuse the obtained second local features and global features through the fusion module F to obtain the fusion feature.
  • the third convolution module C (that is, the fully connected layer) obtains the fusion features required for the image classification task. Based on the above fully connected layer, the extracted fusion features are mapped to the sample label space to obtain the probability value of the image to be identified belonging to each category. Select the label with the largest probability value as the image classification result to achieve the image classification task.
  • the extracted features are classified through a convolution operation, and the classification result is obtained according to the sample label.
  • the image classification task before performing the image classification task based on the neural network model of the above image classification sub-model, it also includes:
  • the neural network model including the above image classification sub-model is trained for image classification.
  • the above-mentioned neural network model including the image classification sub-model is trained using common training methods to achieve the image classification task. For example, according to the actual image classification requirements of the deployed application, the corresponding training set is obtained, and the above-mentioned neural network model including the image classification sub-model is iteratively trained based on the above-mentioned training set until the trained neural network including the image classification sub-model is The model meets the preset conditions and the trained neural network model is obtained.
  • the above-mentioned neural network model including the image classification sub-model is trained for image classification, so as to realize the image classification task of the deployed application.
  • the above recognition module also includes a target detection sub-model
  • step A2 also includes:
  • A22 Based on the above target detection sub-model, perform target detection processing on the extracted features to obtain target detection results.
  • the task of target detection is to find all the targets (objects) of interest in the image and determine the category and location of the targets, and the shapes and sizes of the targets vary, therefore, based on the SSD (Single Shot MultiBox Detector) algorithm Multi-scale detection, receives extracted features of different sizes, and uses a priori frames of different scales and aspect ratios to detect the extracted features of different sizes, thereby outputting the confidence of each category in each detection frame and
  • the detection frame is equivalent to the offset of the a priori frame, that is, the position information of the detection frame.
  • the above SSD algorithm is one of the One-stage target detection algorithms. It performs dense sampling at different positions of the image, uses different scales and aspect ratios to set a priori frames during sampling, and then uses a convolutional neural network to extract the a priori frames.
  • the image features in the image are directly classified and regressed, which has the advantage of fast detection speed.
  • the SSD algorithm uses the idea of grid division to scan the feature maps of different convolution layers to detect targets of different sizes based on feature maps of different scales, that is, to detect small objects based on large-scale feature maps. , detect large objects based on small-scale feature maps, thereby improving detection accuracy.
  • the target detection head adopts different scales and aspect ratios for the above first local features and the above fusion features.
  • the a priori frame is used for target detection, and the confidence of each category in each detection frame and the offset of the detection frame equivalent to the a priori frame are obtained.
  • the above neural network model is applied to face recognition applications such as access control systems, based on the above target detection sub-model
  • face recognition applications such as access control systems
  • the first local features and fusion features of the extracted image to be recognized are detected using a priori frames of different scales and aspect ratios to obtain a detection frame containing the face, through which the face will be recognized for subsequent face recognition.
  • the processing area of the algorithm is limited from the entire image to the face area of the detection frame.
  • multi-scale detection of the image to be detected can also be achieved by scaling the image to be detected at different scales, predicting pyramid features, etc.
  • the above-mentioned multi-scale detection methods are conventional detection methods and will not be described in detail here.
  • target detection is based on feature maps of different scales and using a priori frames of different scales and aspect ratios, so targets of different sizes can be detected, thereby improving the accuracy of target detection and thereby improving the above-mentioned neural network.
  • the accuracy of the model is based on feature maps of different scales and using a priori frames of different scales and aspect ratios, so targets of different sizes can be detected, thereby improving the accuracy of target detection and thereby improving the above-mentioned neural network.
  • the above step A22 specifically includes:
  • the above target detection sub-model detects the location of the target in the image and/or the category of the target based on the Non-Maximum Suppression (NMS) algorithm.
  • NMS Non-Maximum Suppression
  • the neural network model used for target detection tasks as shown in Figure 6 extracts local features of the image to be identified through the first convolution module C1 to obtain the first local features, and the second convolution module C2 and global feature extraction module E are stacked N 1 and M 1 times respectively to extract the first local feature image features, and fuse the obtained second local features and global features through the fusion module F to obtain the fusion features, through the target
  • the detection head detects the extracted first local features and fusion features of different scales to obtain the detection frame and probability value of the target in the image to be recognized, and filters out overlapping detection frames through the non-maximum suppression NMS algorithm to obtain the best detection frame to determine the location of the target in the image to be recognized and/or the category to which the target belongs.
  • non-maximum suppression is an edge refinement technology used to suppress targets that are not maximum values, thereby searching for targets with local maximum values (optimal).
  • the required detection frames can be obtained based on template matching, clustering algorithms and other methods. To obtain the corresponding test results.
  • the overlapping detection frames are removed based on non-maximum suppression to obtain the best detection frame, thereby obtaining the location and location of the target in the image. /or the category to which the target belongs reduces interference items, making the target detection results more accurate.
  • the above neural network model including the target detection sub-model before the above neural network model including the target detection sub-model performs the target detection task, it also includes:
  • the above-mentioned neural network model including the target detection sub-model is trained using common training methods to achieve the target detection task. For example, according to the actual target detection requirements of the deployed application, obtain a labeled training set, and train two or more neural network models including the above target detection sub-model based on the above training set. The obtained two or more including The neural network models of the target detection sub-model are fused to obtain the trained neural network model.
  • the above-mentioned neural network model including the target detection sub-model is subjected to target detection training to achieve the target detection task of the deployed application.
  • the neural network model includes an image classification sub-model and a target detection sub-model
  • the image classification sub-model and the target detection sub-model can be trained separately and then fused to achieve image classification and target detection tasks.
  • the above recognition module also includes a semantic segmentation sub-model
  • step A2 also includes:
  • Semantic segmentation combines image classification, target detection and image segmentation technology. It uses a certain method to segment the image into regions with certain semantic meaning, and identifies the semantic category of each semantic block to obtain a segmented image with pixel-by-pixel semantic annotation.
  • the semantic segmentation of faces usually involves the classification of skin, hair, eyes, mouth, nose, background, etc.
  • the fusion features of the image to be recognized are extracted through the neural network model.
  • the above-mentioned fusion features contain information such as semantics and details of the image to be recognized. Therefore, the fusion features are segmented through the trained semantic segmentation sub-model to obtain the semantic segmentation results of the face.
  • the fusion features contain information such as semantics, location, details, etc.
  • pixel-by-pixel semantic segmentation processing is performed based on the fusion features, making the segmentation results of semantic segmentation more accurate.
  • the above-mentioned semantic segmentation sub-model includes a segmentation module, a merging module and a fourth convolution module;
  • step A23 includes:
  • multi-scale convolution processing is performed on the extracted features to obtain feature maps of different sizes
  • the above feature maps of different sizes are merged along the channel direction;
  • the feature map output by the above merging module is convolved to obtain a semantic segmentation result.
  • the neural network used for semantic segmentation tasks as shown in Figure 8 is used.
  • the network model extracts local features of the image to be identified through the first convolution module C1 to obtain the first local features.
  • the second convolution module C2 and the global feature extraction module E are stacked N 1 and M 1 times respectively to extract the first local features.
  • the feature image features are used for feature extraction, and the obtained second local features and global features are fused through the fusion module F. After the fusion features are obtained, the extracted fusion features are subjected to multi-scale convolution processing through the convolution branch of the segmentation module.
  • Feature maps of different sizes are obtained, and the feature maps of different sizes are merged along the channel direction through the merging module, and the merged feature maps are subjected to 1*1 convolution processing through the fourth convolution module C to achieve feature Cross-channel fusion of graph information to output segmentation results.
  • the feature maps of different sizes make the segmentation of objects of different scales more accurate, thus The accuracy of semantic segmentation is improved, and the merged feature map is subjected to point convolution processing to fuse the information of the feature map across channels, thereby improving the accuracy of semantic segmentation.
  • the above-mentioned segmentation module includes M parallel convolution branches.
  • the uppermost convolution branch adopts 1*1 convolution, and other convolution branches adopt dilated convolutions with sequentially increasing dilation factors.
  • M is a positive integer greater than 1;
  • step A23 when the above-mentioned step A23 performs multi-scale convolution processing on the extracted features based on the above-mentioned segmentation module, it includes:
  • semantic segmentation is a pixel-level classification task
  • guiding pixel classification through semantic information requires obtaining high-resolution and semantic-information-rich feature images, and dilated convolution can effectively expand the receptive field size of the semantic segmentation sub-model.
  • the model parameters of the semantic segmentation sub-model are not increased. Therefore, semantic features at multiple scales are extracted by using parallel dilated convolutions with different dilation factors.
  • the segmentation module uses the convolution branch structure as shown in Figure 9.
  • the top layer of the convolution branch uses 1*1 convolution, the other three layers use a convolution kernel size of 3*3, and the expansion factors are 6, 12, 18 dilated convolutions are used to extract features from the fused features, thereby obtaining feature maps of different scales with richer semantic features.
  • dilated convolution can expand the size of the receptive field without increasing the number of parameters of the neural network model
  • dilated convolutions with different dilation factors are used to extract features from the fusion features and obtain semantic information. Rich feature maps of different scales can effectively improve the accuracy of semantic segmentation, thereby improving the recognition accuracy of the neural network model.
  • the above-mentioned neural network model including the semantic segmentation sub-model before performing the semantic segmentation task, also includes:
  • Semantic segmentation training is performed on the neural network model including the above semantic segmentation sub-model.
  • the above-mentioned neural network model including the semantic segmentation sub-model is trained using conventional training methods to achieve the semantic segmentation task.
  • the recognition module of the above-mentioned neural network model may include one or more of the image classification sub-model, target detection sub-model and semantic segmentation sub-model, but is not limited to this.
  • the specific functions of the recognition module in the neural network model are set according to the image recognition task requirements of the deployed application, and features are extracted based on the above-mentioned feature extraction module based on the Meta Former structure to perform the corresponding image recognition task.
  • sequence number of each step in the above embodiment does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
  • FIG. 10 shows a structural block diagram of the device provided by the embodiment of the present application. For convenience of explanation, only the parts related to the embodiment of the present application are shown.
  • the device includes: an input module 101 and a trained neural network model 102.
  • the above neural network model extracts global features of the image to be recognized based on a purely convolutional Meta Former structure. in,
  • the input module 101 is used to input the image to be recognized into the trained neural network model
  • the neural network model 102 is used to sequentially perform feature extraction and recognition on the above-mentioned images to be recognized to obtain recognition results.
  • the trained neural network model is used to sequentially perform feature extraction and recognition on the image to be detected, thereby outputting the corresponding image recognition results. Since the above neural network model is based on the global features of the convolutional Meta Former structure of the image to be recognized, Therefore, when performing image recognition tasks, the above-mentioned neural network model can pay attention to the global characteristics of the image to be recognized, thereby improving the image recognition accuracy of the above-mentioned neural network model, avoiding the problem of reduced accuracy of the neural network model caused by lightweighting, thereby improving the accuracy of the neural network model. The recognition accuracy of the neural network model is improved, thereby improving the actual deployment effect of the corresponding application of image recognition based on the above neural network model.
  • the above image recognition device further includes:
  • Image acquisition module used to acquire images to be recognized.
  • the above mentioned neural network models include:
  • the feature extraction module is used to extract features from the image to be recognized through the above-mentioned neural network model.
  • the recognition module is used to recognize the image to be recognized through the above neural network model and obtain the recognition result.
  • the above feature extraction module includes:
  • the global feature extraction module is used to extract the global features of the image to be recognized based on the Meta Former structure of pure convolution.
  • the above feature extraction module also includes:
  • the first convolution module is used to extract local features of the above-mentioned image to be recognized to obtain a first local feature image.
  • the above-mentioned global feature extraction module is used to extract global features from the above-mentioned first local feature image to obtain global features.
  • the above feature extraction module also includes:
  • the second convolution module is used to extract local features from the above-mentioned first local feature image to obtain second local features.
  • the fusion module is used to fuse the above-mentioned second local features with the above-mentioned global features to obtain fusion features.
  • the above-mentioned global feature extraction module includes:
  • the first addition and merging module is used for adding and merging the input data of the first residual module and the output data of the above-mentioned global feature extraction module.
  • the second addition and merging module is used for adding and merging the input data of the second residual module and the output data of the above-mentioned feedforward network module.
  • the global feature sub-extraction module is used to extract global features from the input data.
  • the above-mentioned global feature sub-extraction module includes:
  • the first branch is used to extract local features of the input image through depthwise separable convolution.
  • the second branch is used to extract global features from the input image.
  • the merging module is used to merge the features output by the above-mentioned first branch and the above-mentioned second branch according to the pixel position to obtain global features.
  • the above-mentioned second branch includes:
  • the convolution unit is used to perform a convolution operation on the input image to obtain N feature vectors, where N is a positive integer greater than 1.
  • the channel shuffling unit is used to perform channel shuffling operations on each feature vector to obtain new N feature vectors.
  • the sparsification unit is used to sparsely rearrange the above-mentioned new N feature vectors to obtain a sparse feature map.
  • the diffusion unit is used to diffuse the above sparse feature map into a dense feature map through a convolution operation and then output it.
  • the above-mentioned identification module includes:
  • the image classification sub-model is used to classify the extracted features based on the above-mentioned image recognition sub-model to obtain image classification results.
  • the above image classification module includes:
  • the third convolution unit is used to classify the extracted features to obtain image classification results.
  • the above-mentioned identification module also includes:
  • the target detection sub-model is used to perform target detection processing on the extracted features based on the above-mentioned target detection sub-model to obtain target detection results.
  • the above-mentioned target detection sub-model includes:
  • the detection unit is used to detect the location of the target and/or the category of the target in the image based on the Non-Maximum Suppression (NMS) algorithm.
  • NMS Non-Maximum Suppression
  • the above-mentioned identification module also includes:
  • the semantic segmentation sub-model is used to segment the extracted features based on the above-mentioned semantic segmentation sub-model to obtain semantic segmentation results.
  • the above-mentioned semantic segmentation sub-model includes:
  • the multi-scale convolution unit is used to perform multi-scale convolution processing on the extracted features to obtain feature maps of different sizes.
  • the feature merging unit is used to merge the above-mentioned feature maps of different sizes along the channel direction.
  • the fourth convolution unit is used to perform convolution processing on the feature map output by the above-mentioned feature merging unit to obtain the semantic segmentation result.
  • the above-mentioned multi-scale convolution unit includes:
  • the convolution branch unit is used to extract features based on the parallel convolution branch.
  • Figure 11 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the terminal device 11 of this embodiment includes: at least one processor 110 (only one processor is shown in FIG. 11 ), a memory 111 and data stored in the memory 111 and capable of processing in the at least one processor 110 .
  • the computer program 112 runs on the processor 110. When the processor 110 executes the computer program 112, the steps in any of the above method embodiments are implemented.
  • the computer program 112 can be divided into one or more modules/units, the one or more modules/units are stored in the memory 111 and executed by the processor 110 to complete this application.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions.
  • the instruction segments are used to describe the execution process of the computer program 112 in the terminal device 11 .
  • the computer program 112 can be divided into an input module 101 and a trained neural network model 102.
  • the above neural network model extracts global features of the image to be recognized based on a purely convolutional Meta Former structure.
  • the specific functions between each module are as follows :
  • the input module 101 is used to input the image to be recognized into the trained neural network model
  • the neural network model 102 is used to sequentially perform feature extraction and recognition on the above-mentioned images to be recognized to obtain recognition results.
  • the terminal device 11 may be a computing device such as a desktop computer, a notebook, a PDA, a cloud server, etc.
  • the terminal device may include, but is not limited to, a processor 110 and a memory 111.
  • FIG. 11 is only an example of the terminal device 11 and does not constitute a limitation on the terminal device 11. It may include more or less components than shown in the figure, or some components may be combined, or different components may be used. , for example, it may also include input and output devices, network access devices, etc.
  • the so-called processor 110 can be a central processing unit (Central Processing Unit, CPU).
  • the processor 110 can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit). , ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the memory 111 may be an internal storage unit of the terminal device 11 , such as a hard disk or memory of the terminal device 11 .
  • the memory 111 may also be an external storage device of the terminal device 11, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital card equipped on the terminal device 11. (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the memory 111 may also include both an internal storage unit of the terminal device 11 and an external storage device.
  • the memory 111 is used to store operating systems, application programs, boot loaders, data and other programs, such as program codes of the computer programs.
  • the memory 111 can also be used to temporarily store data that has been output or is to be output.
  • Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
  • Each functional unit and module in the embodiment can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above-mentioned integrated unit can be hardware-based. It can also be implemented in the form of software functional units.
  • the specific names of each functional unit and module are only for the convenience of distinguishing each other and are not used to limit the scope of protection of the present application.
  • For the specific working processes of the units and modules in the above system please refer to the corresponding processes in the foregoing method embodiments, and will not be described again here.
  • An embodiment of the present application also provides a network device.
  • the network device includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor.
  • the processor executes The computer program implements the steps in any of the above method embodiments.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the steps in each of the above method embodiments can be implemented.
  • Embodiments of the present application provide a computer program product.
  • the steps in each of the above method embodiments can be implemented when the terminal device executes it.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • this application can implement all or part of the processes in the methods of the above embodiments by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium.
  • the computer program When executed by a processor, the steps of each of the above method embodiments may be implemented.
  • the computer program includes computer program code, which may be in the form of source code, object code, executable file or some intermediate form.
  • the computer-readable medium may at least include: any entity or device capable of carrying computer program code to the camera device/terminal device, recording media, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media.
  • ROM read-only memory
  • RAM random access memory
  • electrical carrier signals telecommunications signals
  • software distribution media For example, U disk, mobile hard disk, magnetic disk or CD, etc.
  • computer-readable media may not be electrical carrier signals and telecommunications signals.
  • the disclosed devices/network devices and methods can be implemented in other ways.
  • the apparatus/network equipment embodiments described above are only illustrative.
  • the division of modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components can be combined or can be integrated into another system, or some features can be omitted, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The present application is suited to the technical field of computer vision, and provides an image recognition method and apparatus based on a neural network model, and a terminal device, wherein the neural network model uses a pure-convolution-based Meta Former structure to extract global features of an image to be recognized, and the image recognition method comprises: inputting an image to be recognized into a trained neural network model, using the neural network model to sequentially perform feature extraction and recognition on the image to be recognized, and thus obtaining a recognition result. The present application achieves a lightweight model while simultaneously improving image recognition accuracy.

Description

基于神经网络模型的图像识别方法、装置及终端设备Image recognition method, device and terminal equipment based on neural network model
本申请要求于2022年6月30日提交中国专利局,申请号为202210763998.5、发明名称为“基于神经网络模型的图像识别方法、装置及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requests the priority of the Chinese patent application submitted to the China Patent Office on June 30, 2022, with the application number 202210763998.5 and the invention name "Image recognition method, device and terminal equipment based on neural network model", and its entire content is approved by This reference is incorporated into this application.
技术领域Technical field
本申请属于计算机视觉技术领域,尤其涉及基于神经网络模型的图像识别方法、装置、终端设备及计算机可读存储介质。The present application belongs to the field of computer vision technology, and in particular relates to image recognition methods, devices, terminal equipment and computer-readable storage media based on neural network models.
背景技术Background technique
近年来,神经网络已经被广泛应用于解决计算机视觉的图像分类、目标检测、图像分割等图像识别任务中。图像分类作为计算机视觉的基础性任务,是目标检测和语义分割的主要支撑,而目标检测作为计算机视觉任务的核心,是对场景理解和认知的基础,如在人脸识别、行人跟踪、自动驾驶等场景中,检测出感兴趣的目标是对场景进行理解的前提,而语义分割则为实现场景的完整理解铺平了道路,越来越多的应用通过语义分割从图像数据中获取知识,其中包括自动驾驶、人机交互、虚拟现实、医学影像分析等应用。因此,对于图像分类、目标检测等图像识别任务的研究成为计算机视觉领域的热点。In recent years, neural networks have been widely used to solve image recognition tasks such as image classification, target detection, and image segmentation in computer vision. As a basic task of computer vision, image classification is the main support for target detection and semantic segmentation. Target detection, as the core of computer vision tasks, is the basis for scene understanding and cognition, such as in face recognition, pedestrian tracking, automatic In scenarios such as driving, detecting objects of interest is the prerequisite for understanding the scene, and semantic segmentation paves the way for complete understanding of the scene. More and more applications obtain knowledge from image data through semantic segmentation. These include applications such as autonomous driving, human-computer interaction, virtual reality, and medical image analysis. Therefore, research on image recognition tasks such as image classification and target detection has become a hot topic in the field of computer vision.
轻量化模型是指能够在算力较低的终端设备上顺利运行,运算开销较低的模型。由于卷积神经网络具有参数多、计算量大的特点,嵌入式、移动终端设备等终端设备在计算力和存储容量资源受限,神经网络模型的轻量化成为近年来的研究热点。A lightweight model refers to a model that can run smoothly on terminal devices with low computing power and has low computing overhead. Since convolutional neural networks have the characteristics of many parameters and large amounts of calculations, and terminal devices such as embedded and mobile terminal devices have limited computing power and storage capacity resources, the lightweighting of neural network models has become a research hotspot in recent years.
当前实际部署应用的用于图像识别任务的神经网络模型,主要还是卷积神经网络模型,基于卷积神经网络的图像识别方法多依赖于深层网络结构提高检测精度,压缩卷积神经网络模型使其达到轻量化的同时会影响网络的识别精度,使得轻量化的神经网络模型的准确率相对降低。The currently deployed neural network models used for image recognition tasks are mainly convolutional neural network models. Image recognition methods based on convolutional neural networks mostly rely on deep network structures to improve detection accuracy. Compressing the convolutional neural network model makes it Achieving lightweight will also affect the recognition accuracy of the network, making the accuracy of lightweight neural network models relatively reduced.
发明内容Contents of the invention
本申请实施例提供了基于神经网络模型的图像识别方法、装置及终端设备,有利于在实现轻量化模型的同时提高图像识别的准确率。Embodiments of the present application provide image recognition methods, devices and terminal equipment based on neural network models, which are beneficial to realizing lightweight models while improving the accuracy of image recognition.
第一方面,本申请实施例提供了一种基于神经网络模型的图像识别方法,上述神网络模型通过基于纯卷积的Meta Former结构提取待识别图像的全局特征,上述图像识别方法包括:In the first aspect, embodiments of the present application provide an image recognition method based on a neural network model. The above-mentioned neural network model extracts the global features of the image to be recognized through a meta former structure based on pure convolution. The above-mentioned image recognition method includes:
将待识别图像输入经训练的上述神经网络模型,通过上述神经网络模型对待识别图像依次进行特征提取和识别,得到识别结果。The image to be recognized is input into the trained neural network model, and the features of the image to be recognized are sequentially extracted and recognized through the neural network model to obtain the recognition result.
第二方面,本申请实施例提供了一种图像识别装置,上述图像识别装置包括:In a second aspect, embodiments of the present application provide an image recognition device. The image recognition device includes:
输入模块和经训练的神经网络模型,上述神经网络模型基于纯卷积的Meta Former结构提取待识别图像的全局特征。Input module and trained neural network model. The above neural network model extracts the global features of the image to be recognized based on the purely convolutional Meta Former structure.
上述输入模块用于:将待识别图像输入经训练的神经网络模型。The above-mentioned input module is used to input the image to be recognized into the trained neural network model.
上述神经网络模型用于:对所述待识别图像依次进行特征提取和识别,得到识别结果。The above neural network model is used to sequentially perform feature extraction and recognition on the image to be recognized to obtain recognition results.
第三方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在上述存储器中并可在上述处理器上运行的计算机程序,上述处理器执行上述计算机程序时实现上述第一方面上述的基于神经网络模型的图像识别方法。In a third aspect, embodiments of the present application provide a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the first step is implemented. In terms of the above-mentioned image recognition method based on neural network model.
第四方面,本申请实施例提供了一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序被处理器执行时实现上述第一方面中上述的基于神经网络模型的图像识别方法的步骤。In a fourth aspect, embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the above-mentioned neural network model-based method in the first aspect is implemented. Image recognition method steps.
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项上述的基于神经网络模型的图像识别方法。In a fifth aspect, embodiments of the present application provide a computer program product that, when run on a terminal device, causes the terminal device to execute the image recognition method based on a neural network model described in any one of the above first aspects.
本申请实施例与现有技术相比存在的有益效果是:将待识别图像输入到经训练的神经网络模型中,通过上述神经网络模型对上述待识别图像依次进行特征提取和识别,从而得到识别结果,由于上述神经网络模型通过基于卷积的Meta Former结构提取图像的全局特征,使得在基于轻量级卷积神经网络构建上述神经网络模型时,上述神经网络模型在进行待识别图像的识别时 能够使上述神经网络模型关注待识别图像的全局特征,拥有更多的特征,减少轻量级卷积神经网络造成的精度下降的问题,有利于在实现轻量化模型的同时提高神经网络模型的图像识别精度,进而提高神经网络模型的准确率。Compared with the existing technology, the beneficial effects of the embodiments of the present application are as follows: the image to be recognized is input into the trained neural network model, and the features of the image to be recognized are sequentially extracted and recognized through the neural network model, thereby obtaining the recognition As a result, since the above neural network model extracts the global features of the image through the convolution-based Meta Former structure, when the above neural network model is built based on the lightweight convolutional neural network, the above neural network model is more effective in identifying the image to be recognized. It can make the above-mentioned neural network model pay attention to the global features of the image to be recognized, have more features, reduce the problem of accuracy decline caused by lightweight convolutional neural networks, and help improve the image quality of the neural network model while realizing a lightweight model. Recognition accuracy, thereby improving the accuracy of the neural network model.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below.
图1是本申请一实施例提供的一种基于神经网络模型的图像识别方法的流程示意图;Figure 1 is a schematic flow chart of an image recognition method based on a neural network model provided by an embodiment of the present application;
图2是本申请实施例提供的一种全局特征提取模块结构示意图;Figure 2 is a schematic structural diagram of a global feature extraction module provided by an embodiment of the present application;
图3是本申请实施例提供的一种全局特征提取模块结构示意图;Figure 3 is a schematic structural diagram of a global feature extraction module provided by an embodiment of the present application;
图4是本申请实施例提供的一种全局特征子提取模块提取全局特征的流程示意图;Figure 4 is a schematic flow chart of global feature extraction by a global feature sub-extraction module provided by an embodiment of the present application;
图5是本申请实施例提供的一种用于图像分类任务的神经网络模型结构示意图;Figure 5 is a schematic structural diagram of a neural network model used for image classification tasks provided by an embodiment of the present application;
图6是本申请实施例提供的一种用于目标检测任务的神经网络模型结构示意图;Figure 6 is a schematic structural diagram of a neural network model used for target detection tasks provided by an embodiment of the present application;
图7是本申请实施例提供的目标检测任务的检测框结构示意图;Figure 7 is a schematic structural diagram of the detection frame of the target detection task provided by the embodiment of the present application;
图8是本申请实施例提供的一种用于语义分割任务的神经网络模型结构示意图;Figure 8 is a schematic structural diagram of a neural network model used for semantic segmentation tasks provided by an embodiment of the present application;
图9是本申请实施例提供的语义分割子模型的卷积分支(分割模块)结构示意图;Figure 9 is a schematic structural diagram of the convolution branch (segmentation module) of the semantic segmentation sub-model provided by the embodiment of the present application;
图10是本申请实施例提供的图像识别装置的结构示意图;Figure 10 is a schematic structural diagram of an image recognition device provided by an embodiment of the present application;
图11是本申请实施例提供的终端设备的结构示意图。Figure 11 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
具体实施方式Detailed ways
以下描述中,为了说明而不是为了限定,提出了诸如特定***结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的***、装置、电路以及方法的详细说明,以免不 必要的细节妨碍本申请的描述。In the following description, for the purpose of explanation rather than limitation, specific details such as specific system structures and technologies are provided to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It will be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described features, integers, steps, operations, elements and/or components but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or collections thereof.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It will also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In addition, in the description of this application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。Reference in this specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Therefore, the phrases "in one embodiment", "in some embodiments", "in other embodiments", "in other embodiments", etc. appearing in different places in this specification are not necessarily References are made to the same embodiment, but rather to "one or more but not all embodiments" unless specifically stated otherwise.
实施例一:Example 1:
图1示出了本发明实施例提供的一种基于神经网络模型的图像识别方法的流程示意图,详述如下:Figure 1 shows a schematic flow chart of an image recognition method based on a neural network model provided by an embodiment of the present invention. The details are as follows:
将待识别图像输入经训练的神经网络模型,通过上述神经网络模型对待识别图像依次进行特征提取和识别,得到识别结果,上述神经网络模型基于轻量化的卷积神经网络进行构建,并通过基于纯卷积的Meta Former结构提取待识别图像的全局特征。The image to be recognized is input into the trained neural network model, and the features of the image to be recognized are extracted and recognized sequentially through the above neural network model to obtain the recognition result. The above neural network model is constructed based on a lightweight convolutional neural network, and is based on a pure The convolutional Meta Former structure extracts the global features of the image to be recognized.
上述Meta Former结构是VIT模型用于计算机视觉任务的派生模型,VIT模型将基于自注意力的Transformer模型用于图像任务中,通过矩阵运算实现自注意力机制,相比于图像任务中的传统的基于卷积神经网络模型,VIT模型在大数据集上具有比卷积神经网络相同甚至更好的效果和更节约的成本,但由于VIT模型应用在图像任务上需要大量运算资源和硬件需求,使得其硬件支持并不友好,因此,通过改进的基于硬件支持良好的纯卷积的Meta Former结构提取图像的全局特征以进行图像任务,可继承卷积神经网络良好 的硬件支持能力和低运算开销,同时融合VIT模型提取全局特征的特点,提高了神经网络模型的精度。The above Meta Former structure is a derived model of the VIT model for computer vision tasks. The VIT model uses the Transformer model based on self-attention for image tasks, and realizes the self-attention mechanism through matrix operations. Compared with the traditional model in image tasks, Based on the convolutional neural network model, the VIT model has the same or even better effects and is more cost-effective than the convolutional neural network on large data sets. However, because the application of the VIT model in image tasks requires a large amount of computing resources and hardware requirements, it makes Its hardware support is not friendly. Therefore, through an improved Meta Former structure based on pure convolution with good hardware support, the global features of the image are extracted for image tasks, and the good hardware support capabilities and low computing overhead of the convolutional neural network can be inherited. At the same time, it integrates the characteristics of the VIT model to extract global features and improves the accuracy of the neural network model.
具体地,在对神经网络模型进行轻量化时,由于神经网络模型依靠深层网络结构提高图像识别精度,压缩神经网络模型会使得神经网络模型的识别精度较低,因此,在构建神经网络模型时,在神经网络模型中融入改进的基于纯卷积的Meta Former结构提取待识别图像的全局特征,从而通过该包含Meta Former结构的神经网络模型对待识别图像进行特征提取和图像识别,得到识别结果。Specifically, when lightweighting the neural network model, since the neural network model relies on the deep network structure to improve image recognition accuracy, compressing the neural network model will make the recognition accuracy of the neural network model lower. Therefore, when building the neural network model, The improved pure convolution-based Meta Former structure is integrated into the neural network model to extract the global features of the image to be recognized, and the recognition results are obtained through feature extraction and image recognition of the image to be recognized through the neural network model containing the Meta Former structure.
本申请实施例中,基于经训练的神经网络模型对待识别图像依次进行特征提取和识别,从而获取识别结果,上述神经网络模型基于纯卷积的Meta Former结构提取待识别图像的全局特征,由于Meta Former结构通过提取全局特征进行图像识别任务,具有良好的图像识别精度,能够使神经网络模型在实现轻量化的同时具有良好的图像识别精度,从而提高了神经网络模型的准确率。In the embodiment of this application, the features of the image to be recognized are extracted and recognized sequentially based on the trained neural network model to obtain the recognition results. The above neural network model extracts the global features of the image to be recognized based on the purely convolutional Meta Former structure. Since Meta The Former structure performs image recognition tasks by extracting global features and has good image recognition accuracy. It can make the neural network model lightweight while having good image recognition accuracy, thus improving the accuracy of the neural network model.
本申请实施例中,通过经训练的神经网络模型对待检测图像依次进行特征提取和识别,从而输出相应的图像识别结果,由于上述神经网络模型基于卷积的Meta Former结构待识别图像的全局特征,因此,上述神经网络模型在进行图像识别任务时可关注到待识别图像的全局特征,从而提高上述神经网络模型的图像识别精度,避免了轻量化造成的神经网络模型精度下降的问题,从而提高了神经网络模型的识别准确率,进而提高基于上述神经网络模型的图像识别的相应应用的实际部署效果。In the embodiment of this application, the trained neural network model is used to sequentially perform feature extraction and recognition on the image to be detected, thereby outputting the corresponding image recognition results. Since the above neural network model is based on the global features of the convolutional Meta Former structure of the image to be recognized, Therefore, when performing image recognition tasks, the above-mentioned neural network model can pay attention to the global characteristics of the image to be recognized, thereby improving the image recognition accuracy of the above-mentioned neural network model, avoiding the problem of reduced accuracy of the neural network model caused by lightweighting, thereby improving the accuracy of the neural network model. The recognition accuracy of the neural network model is improved, thereby improving the actual deployment effect of the corresponding application of image recognition based on the above neural network model.
在一些实施例中,在基于上述神经网络模型进行图像识别之前,还包括:In some embodiments, before performing image recognition based on the above neural network model, it also includes:
对构建的神经网络模型进行训练。Train the built neural network model.
可选的,由于部署的不同应用的图像识别任务的待识别图像、识别目标、结果等各有不同,因此,在基于上述神经网络模型进行相应的图像识别任务之前,根据实际部署应用的图像识别需求对神经网络模型进行训练。例如,要为无人驾驶汽车部署一个神经网络模型,实现对行人的检测,识别行人位置,以便进行安全可靠的行驶,此时,需要对上述神经网络模型进行行人检测或行人重检测的训练,使上述神经网络模型具备检测识别行人及其位置信息的能力。Optionally, since the images to be recognized, recognition targets, results, etc. of image recognition tasks for different deployed applications are different, before performing the corresponding image recognition tasks based on the above neural network model, the image recognition tasks of the actual deployed applications are Need to train the neural network model. For example, to deploy a neural network model for driverless cars to detect pedestrians and identify pedestrian locations for safe and reliable driving, at this time, the above neural network model needs to be trained for pedestrian detection or pedestrian re-detection. The above neural network model has the ability to detect and identify pedestrians and their location information.
可选的,在对神经网络模型进行训练时,可从图像数据库获取所需的标准的标注数据集作为训练集,也可以获取未标注数据集作为训练集。在对上述神经网络模型进行训练时,还可以采用验证集验证上述神经网络模型的泛化能力,以评估上述神经网络模型能力,决定是否停止继续训练,和/或,通过测试集评估上述神经网络模型的泛化能力。Optionally, when training a neural network model, the required standard annotated data set can be obtained from the image database as a training set, or an unlabeled data set can be obtained as a training set. When training the above-mentioned neural network model, the verification set can also be used to verify the generalization ability of the above-mentioned neural network model to evaluate the ability of the above-mentioned neural network model and decide whether to stop training, and/or evaluate the above-mentioned neural network through the test set The generalization ability of the model.
本申请实施例中,在基于神经网络模型进行图像识别之前,根据实际部署应用的实际需求,对神经网络模型进行相应的训练,以便更准确地实现对于不同部署应用的图像识别任务。In the embodiment of the present application, before image recognition is performed based on the neural network model, the neural network model is trained accordingly according to the actual needs of the actual deployment application, so as to more accurately realize the image recognition tasks for different deployment applications.
在一些实施例中,在基于上述神经网络模型进行图像识别之前,还包括:In some embodiments, before performing image recognition based on the above neural network model, it also includes:
获取待识别图像。Get the image to be recognized.
可选的,上述待识别图像可以是摄像设备采集得到的一个图像,也可以是摄像设备采集得到的视频流中的一个图像帧。Optionally, the above-mentioned image to be recognized may be an image captured by a camera device, or may be an image frame in a video stream captured by a camera device.
可选的,由于不同应用领域的图像识别任务可能不同,所采用的摄像设备、采集待检测图像的规则等也不尽相同,因此,根据各应用领域相应的采集方法和采集规则等,获取相应的待识别图像。例如,在交通领域的应用中,如交通违法事件检测,则需要通过安装的固定位置的摄像头,采集实时视频,获取视频流的连续图像帧作为待检测图像,以进行相应的图像识别任务。Optionally, since the image recognition tasks in different application fields may be different, the camera equipment used, the rules for collecting images to be detected, etc. are also different. Therefore, the corresponding collection methods and collection rules are obtained according to the corresponding collection methods and collection rules in each application field. of images to be recognized. For example, in applications in the transportation field, such as the detection of traffic violations, it is necessary to collect real-time video through a camera installed at a fixed position, and obtain continuous image frames of the video stream as images to be detected to perform corresponding image recognition tasks.
本申请实施例中,根据各应用领域中图像识别任务所需的图像,采用相应的采集方法和采集规则,获取符合图像识别任务要求的待识别图像,以便进行图像识别任务。In the embodiments of the present application, according to the images required for image recognition tasks in various application fields, corresponding collection methods and collection rules are used to obtain images to be recognized that meet the requirements of the image recognition task, so as to perform the image recognition task.
在一些实施例中,上述步骤神经网络模型包括特征提取模块和识别模块,上述特征提取模块包括基于卷积的Meta Former结构构建的全局特征提取模块;In some embodiments, the neural network model of the above steps includes a feature extraction module and a recognition module, and the above feature extraction module includes a global feature extraction module built based on the convolutional Meta Former structure;
相应的,上述通过上述神经网络模型对上述待识别图像进行识别,得到识别结果,包括:Correspondingly, the above-mentioned image to be recognized is recognized through the above-mentioned neural network model, and the recognition results are obtained, including:
A1、基于上述特征提取模块对上述待识别图像进行特征提取。A1. Based on the above feature extraction module, perform feature extraction on the above image to be recognized.
A2、基于上述识别模块对提取到的特征进行识别,得到识别结果。A2. Based on the above recognition module, identify the extracted features and obtain the recognition results.
可选的,由于图像识别任务包括图像分类、目标检测、语义分割以及实例分割等,各图像识别任务在进行相应的识别时,所需的待识别图像的特征和对提取到的特征的识别方法不尽相同,因此,通过特征提取模块对待识别 图像进行特征提取,基于各图像识别任务所需的特征和相应的识别方法对提取到的特征进行识别,得到相应的识别结果。例如,在智能驾驶、智能监控、行人分析以及智能机器人等领域都会应用到行人检测,判断输入图片或视频图像帧中是否包含行人,在进行行人检测时,由于图像中行人的尺寸大小不一,需要利用提取到的不同大小的特征检测待识别图像中存在的行人,并输出行人的信息和位置。Optionally, since image recognition tasks include image classification, target detection, semantic segmentation, instance segmentation, etc., when performing corresponding recognition for each image recognition task, the required features of the image to be recognized and the recognition method for the extracted features are required They are not the same. Therefore, the feature extraction module is used to extract features from the image to be recognized, and the extracted features are identified based on the features required for each image recognition task and the corresponding recognition method, and the corresponding recognition results are obtained. For example, pedestrian detection is applied in the fields of intelligent driving, intelligent monitoring, pedestrian analysis, and intelligent robots to determine whether the input picture or video image frame contains pedestrians. When performing pedestrian detection, due to the different sizes of pedestrians in the image, It is necessary to use the extracted features of different sizes to detect the presence of pedestrians in the image to be recognized, and output the information and location of the pedestrians.
本申请实施例中,基于特征提取模块对待识别图像进行特征提取,由于各图像识别任务在进行识别时,所需的待识别图像的特征和对特征的识别方法不尽相同,因此,通过识别模块获取相应的图像识别任务的所需特征,对该相应的特征进行识别,得到相应的识别结果,以提高各图像识别任务的识别准确率和识别效率。In the embodiment of the present application, feature extraction is performed on the image to be recognized based on the feature extraction module. Since each image recognition task requires different features of the image to be recognized and methods for identifying the features during recognition, therefore, through the recognition module Obtain the required features of the corresponding image recognition task, identify the corresponding features, and obtain the corresponding recognition results to improve the recognition accuracy and efficiency of each image recognition task.
在一些实施例中,上述特征提取模块还包括第一卷积模块;In some embodiments, the above feature extraction module further includes a first convolution module;
相应的,上述步骤A1包括:Correspondingly, the above step A1 includes:
A11、基于上述第一卷积模块对上述待识别图像进行局部特征提取,得到第一局部特征图像。A11. Extract local features of the image to be recognized based on the first convolution module to obtain a first local feature image.
可选的,上述第一卷积模块基于轻量化卷积神经网络结构进行构建,以减少第一卷积模块的计算量。例如,基于MobileNetV2或SqueezeNet等轻量级网络结构进行构建。MobileNetV2是一种轻量级的神经网络,采用深度可分离卷积(Depthwise seperable convolution)替代了普通卷积,引进了线性瓶颈(Linear Bottleneck)来提高模型的表达能力,以避免非线性变换造成的特征信息损失,并通过倒残差结构将特征图通道进行扩张和避免梯度消失或***的问题,丰富了特征数量,进而提高精度。而SqueezeNet是一种精简化的轻量级卷积神经网络结构,自定义了自己的卷积模块,分别压缩和扩展数据的通道数,并通过Deep Compression深度压缩进一步压缩参数,以达到超轻量级效果,适用于计算力较弱的终端设备。Optionally, the above-mentioned first convolution module is constructed based on a lightweight convolutional neural network structure to reduce the calculation amount of the first convolution module. For example, build on lightweight network structures such as MobileNetV2 or SqueezeNet. MobileNetV2 is a lightweight neural network that uses depthwise separable convolution to replace ordinary convolution, and introduces a linear bottleneck to improve the expression ability of the model and avoid problems caused by nonlinear transformation. Feature information is lost, and the feature map channel is expanded through the inverted residual structure to avoid the problem of gradient disappearance or explosion, which enriches the number of features and thereby improves accuracy. SqueezeNet is a simplified lightweight convolutional neural network structure. It customizes its own convolution module to compress and expand the number of channels of data respectively, and further compresses parameters through Deep Compression to achieve ultra-lightweight. Level effect, suitable for terminal devices with weak computing power.
上述深度可分离卷积是指将一般的卷积分为两个步骤:逐通道卷积和逐点卷积。深度可分离卷积可由两个独立层定义:用于空间滤波的轻量级深度卷积和用于特征生成的1*1点卷积,通过轻量级深度卷积对图像进行逐通道的卷积,一个卷积核负责一个通道,在不改变输入特征图像深度的情况下对每一通道进行卷积操作,得到与输入特征图通道数相同的输出特征图,并通 过1*1卷积对输出特征图进行升维和降维,将其在深度方向进行加权组合,在不改变特征图大小的前提下组合各通道的特征信息。The above-mentioned depth-separable convolution refers to dividing general convolution into two steps: channel-wise convolution and point-wise convolution. Depthwise separable convolution can be defined by two independent layers: lightweight depth convolution for spatial filtering and 1*1 point convolution for feature generation. The image is convolved channel-by-channel through lightweight depth convolution. Product, one convolution kernel is responsible for one channel, performs a convolution operation on each channel without changing the depth of the input feature image, and obtains an output feature map with the same number of channels as the input feature map, and performs a 1*1 convolution on The output feature map is dimensioned up and down, weighted and combined in the depth direction, and the feature information of each channel is combined without changing the size of the feature map.
A12、基于上述全局特征提取模块对上述第一局部特征图像进行全局特征提取,得到全局特征。A12. Based on the above global feature extraction module, perform global feature extraction on the above first local feature image to obtain global features.
可选的,将上述第一卷积模块提取到的第一局部特征图像作为全局特征提取模块的输入,通过上述基于卷积的Meta Former结构,对得到的第一局部特征图像进行全局特征的提取,得到待识别图像的全局特征。Optionally, the first local feature image extracted by the above-mentioned first convolution module is used as the input of the global feature extraction module, and the global features are extracted from the obtained first local feature image through the above-mentioned convolution-based Meta Former structure. , obtain the global features of the image to be recognized.
本申请实施例中,由于采用轻量级网络结构提取第一局部特征,减少了神经网络模型的参数和计算量,从而加快神经网络模型图像识别的速率,并通过基于卷积的Meta Former结构提取全局特征,从而使得后续能够根据全局特征进行相应的图像识别任务,减少轻量级网络结构造成的神经网络模型精度影响,从而提高识别准确率。In the embodiment of this application, the lightweight network structure is used to extract the first local feature, which reduces the parameters and calculation amount of the neural network model, thereby speeding up the image recognition rate of the neural network model, and extracts it through the convolution-based Meta Former structure. Global features, so that subsequent image recognition tasks can be carried out based on global features, reducing the impact of the lightweight network structure on the accuracy of the neural network model, thereby improving the recognition accuracy.
在一些实施例中,上述特征提取模块还包括第二卷积模块和融合模块,上述第二卷积模块采用轻量级卷积神经网络提取第二局部特征;In some embodiments, the above-mentioned feature extraction module also includes a second convolution module and a fusion module, and the above-mentioned second convolution module uses a lightweight convolutional neural network to extract the second local feature;
相应的,上述步骤A1还包括:Correspondingly, the above step A1 also includes:
A13、基于上述第二卷积模块对上述第一局部特征图像进行局部特征提取,得到第二局部特征。A13. Extract local features from the above-mentioned first local feature image based on the above-mentioned second convolution module to obtain the second local features.
可选的,上述第二卷积模块基于如MobileNetV2、SqueezeNet等轻量化卷积神经网络结构进行构建,减少第二卷积模块的计算量,从而提取第二局部特征。其原理与上述第一卷积模块一致,在此不再赘述。Optionally, the above-mentioned second convolution module is constructed based on a lightweight convolutional neural network structure such as MobileNetV2, SqueezeNet, etc., to reduce the calculation amount of the second convolution module, thereby extracting the second local feature. Its principle is consistent with the above-mentioned first convolution module and will not be described again here.
A14、基于上述融合模块将上述第二局部特征与上述全局特征进行融合,得到融合特征。A14. Based on the above fusion module, the above second local feature and the above global feature are fused to obtain the fusion feature.
可选的,将上述第二局部特征和全局特征在通道方向上进行连接,得到描述了局部特征和全局特征的融合特征,以提高特征图像的信息表征能力,从而提高后续图像识别的准确率。Optionally, the above-mentioned second local features and global features are connected in the channel direction to obtain fusion features describing the local features and global features, so as to improve the information representation ability of the feature image, thereby improving the accuracy of subsequent image recognition.
本申请实施例中,由于将第二局部特征和全局特征进行融合,获取可表征局部特征和全局特征的融合特征,提高了特征图像的信息表征能力,使得后续在根据融合特征进行相应的图像识别任务时,得到相对准确的图像识别结果,从而提高神经网络模型的识别准确率。In the embodiment of the present application, due to the fusion of the second local feature and the global feature, the fusion feature that can represent the local feature and the global feature is obtained, which improves the information representation ability of the feature image, so that corresponding image recognition can be performed subsequently based on the fusion feature. During the task, relatively accurate image recognition results are obtained, thereby improving the recognition accuracy of the neural network model.
需要指出的是,在一些实施例中,在提取上述第一局部特征和上述融合 特征时,由于如目标检测等图像识别任务需要对不同尺度的特征图像进行识别,因此,需要提取不同尺度的第一局部特征和融合特征,以提高相应的图像识别任务的识别准确率。It should be pointed out that in some embodiments, when extracting the above-mentioned first local features and the above-mentioned fusion features, since image recognition tasks such as target detection require the recognition of feature images of different scales, therefore, it is necessary to extract the first local features of different scales. A local feature and fused feature to improve the recognition accuracy of the corresponding image recognition task.
在一些实施例中,上述全局特征提取模块由输入端至输出端包括依次连接的第一残差模块、全局特征子提取模块、第一加并模块、第二残差模块、前馈网络模块和第二加并模块;In some embodiments, the above-mentioned global feature extraction module includes a first residual module, a global feature sub-extraction module, a first addition module, a second residual module, a feedforward network module and The second addition and merging module;
相应的,上述步骤A12具体包括:Correspondingly, the above step A12 specifically includes:
上述第一加并模块用于对上述第一残差模块的输入数据和上述全局特征子提取模块的输出数据进行相加合并处理;The above-mentioned first addition module is used for adding and merging the input data of the above-mentioned first residual module and the output data of the above-mentioned global feature extraction module;
上述第二加并模块用于对上述第二残差模块的输入数据和上述前馈网络模块的输出数据进行相加合并处理;The above-mentioned second addition module is used to add and combine the input data of the above-mentioned second residual module and the output data of the above-mentioned feedforward network module;
上述全局特征子提取模块包括第一分支、第二分支和合并模块;The above-mentioned global feature sub-extraction module includes a first branch, a second branch and a merging module;
上述第一分支用于通过深度可分离卷积对输入图像进行局部特征提取;The above-mentioned first branch is used to extract local features of the input image through depth-separable convolution;
上述第二分支用于对输入图像进行全局特征提取;The above-mentioned second branch is used to extract global features from the input image;
上述合并模块用于根据像素位置对上述第一分支和上述第二分支输出的特征进行合并处理,得到全局特征。The above-mentioned merging module is used to merge the features output by the above-mentioned first branch and the above-mentioned second branch according to the pixel position to obtain global features.
其中,上述第一残差模块和上述第二残差模块是指分别基于上述全局特征子提取模块和上述前馈网络模块采用残差结构,以解决梯度***和网络性能的退化问题。Among them, the above-mentioned first residual module and the above-mentioned second residual module refer to using a residual structure based on the above-mentioned global feature extraction module and the above-mentioned feedforward network module respectively to solve the problem of gradient explosion and network performance degradation.
可选的,上述全局特征提取模块的网络结构如图2所示,将上述第一局部特征图像作为第一残差模块的输入,基于上述全局特征子提取模块的第一分支对输入的第一局部特征图像进行局部特征提取,并基于上述全局特征子提取模块的第二分支对输入的第一局部特征图像进行全局特征提取,通过上述合并模块将上述第一分支和上述第二分支输出的特征进行合并处理,得到待识别图像的全局特征。Optionally, the network structure of the above-mentioned global feature extraction module is shown in Figure 2. The above-mentioned first local feature image is used as the input of the first residual module, and the first branch of the above-mentioned global feature sub-extraction module is used to input the first Perform local feature extraction on the local feature image, and perform global feature extraction on the input first local feature image based on the second branch of the above-mentioned global feature sub-extraction module, and use the above-mentioned merging module to combine the features output by the above-mentioned first branch and the above-mentioned second branch. Perform merging processing to obtain the global features of the image to be recognized.
可选的,上述第一分支基于深度可分离卷积对输入的第一局部特征图像的进行局部特征提取,以减少参数量和运算成本。例如,第一分支采用卷积核尺寸为3*3、卷积步幅为1的卷积,输入边缘填充一圈零以保持卷积后图像的分辨率不变,即,使上述第一分支输出与输入的第一局部特征图像大小一致的特征图。上述第一分支还可以采用可分离卷积、分组卷积或其它常规 卷积对输入的第一局部特征图像进行局部特征提取,上述卷积操作为常规卷积操作,在此不一一赘述。Optionally, the above-mentioned first branch performs local feature extraction on the input first local feature image based on depth-separable convolution to reduce the amount of parameters and operation cost. For example, the first branch uses convolution with a convolution kernel size of 3*3 and a convolution stride of 1. The input edge is padded with a circle of zeros to keep the resolution of the convolved image unchanged, that is, the above first branch Output a feature map with the same size as the input first local feature image. The above-mentioned first branch can also use separable convolution, grouped convolution or other conventional convolutions to extract local features from the input first local feature image. The above-mentioned convolution operations are conventional convolution operations and will not be described in detail here.
可选的,上述合并模块在将第一分支和第二分支输出的特征进行合并处理时,基于像素位置将上述第一分支和第二分支输出的特征相加,得到全局特征,在不增加特征图像维度的前提下,使该全局特征描述了更多的信息量,即同时包含局部特征和全局特征的特征信息。Optionally, when the above-mentioned merging module combines the features output by the first branch and the second branch, the features output by the above-mentioned first branch and the second branch are added based on the pixel position to obtain global features without adding features. Under the premise of image dimensions, the global feature describes more information, that is, it contains feature information of both local features and global features.
本申请实施例中,通过深度可分离卷积提取局部特征,将局部特征和提取到的全局特征相加得到包含局部特征的全局特征图像,由于局部特征和全局特征通过相加的方式进行合并,得到的特征图的维度,即通道数不变,但其描述的信息量增加了,从而在不增加计算量的前提下提高了后续图像识别的准确率,进而提高神经网络模型的准确率。In the embodiment of the present application, local features are extracted through depth-separable convolution, and the local features and the extracted global features are added to obtain a global feature image containing local features. Since the local features and global features are merged through addition, The dimension of the obtained feature map, that is, the number of channels remains unchanged, but the amount of information it describes increases, thereby improving the accuracy of subsequent image recognition without increasing the amount of calculation, thereby improving the accuracy of the neural network model.
在一些实施例中,上述全局特征提取模块结构如图3所示,第一残差模块和第二残差模块中,在如图2所示的结构上各增加一个BN(Batch Normalization,批量归一化)层对输入图像进行归一化处理,以防止在神经网络模型的训练过程中,中间层数据分布发生改变,出现梯度消失或梯度***的问题,加快神经网络模型的训练速度。In some embodiments, the above-mentioned global feature extraction module structure is shown in Figure 3. In the first residual module and the second residual module, a BN (Batch Normalization, batch normalization) is added to the structure shown in Figure 2. The normalization layer normalizes the input image to prevent the middle layer data distribution from changing during the training process of the neural network model, causing the problem of gradient disappearance or gradient explosion, and speeding up the training of the neural network model.
在一些实施例中,在进行待识别图像的识别时,上述第二分支具体用于:In some embodiments, when identifying an image to be recognized, the above-mentioned second branch is specifically used to:
B1、对输入图像进行卷积操作,得到N条特征向量,其中,N为大于1的正整数。B1. Perform a convolution operation on the input image to obtain N feature vectors, where N is a positive integer greater than 1.
可选的,在对输入的第一局部特征图像进行卷积操作时,将输入图像分为N组,每组采用一个可覆盖组内的所有像素的大卷积核进行卷积处理,提取到相应的一条特征向量,得到共N条特征向量。Optionally, when performing a convolution operation on the input first local feature image, the input image is divided into N groups, and each group is convolved using a large convolution kernel that can cover all pixels in the group, and the Corresponding to one feature vector, a total of N feature vectors are obtained.
B2、将各条特征向量进行通道混洗操作,得到新的N条特征向量。B2. Perform channel shuffling operation on each feature vector to obtain new N feature vectors.
可选的,基于上述N条特征向量,将每条特征向量沿通道方向分为N组后,将各条特征向量打乱并重新排列,组合生成新的N条特征向量,原特征向量的通道顺序被打乱,从而使特征信息在不同通道中流动,以实现信息的交流融合。Optionally, based on the above N feature vectors, divide each feature vector into N groups along the channel direction, scramble and rearrange each feature vector, and combine to generate new N feature vectors, the channels of the original feature vectors The order is disrupted, allowing feature information to flow in different channels to achieve information exchange and fusion.
需要注意的是,生成的新的N条特征向量,每条特征向量由N部分来自不同的特征向量的组构成,即,构成新的特征向量的N个组来自N条不同的特征向量,以保证生成的新的特征向量包含了其它特征向量的特征信息,使 得新生成的各条特征向量可表征整个输入图像的特征。It should be noted that each of the new N feature vectors generated is composed of N groups from different feature vectors, that is, the N groups that constitute the new feature vector come from N different feature vectors, so that It is guaranteed that the generated new feature vectors contain the feature information of other feature vectors, so that each newly generated feature vector can characterize the characteristics of the entire input image.
B3、将上述新的N条特征向量进行稀疏化重排,得到稀疏特征图。B3. Perform sparse rearrangement of the above new N feature vectors to obtain a sparse feature map.
可选的,将得到的新的N条特征向量重新排列,构成稀疏特征图,并以零填充稀疏特征图中没有实际数据的部分,以将特征向量信息扩散到各个组中。Optionally, the obtained new N feature vectors are rearranged to form a sparse feature map, and the parts without actual data in the sparse feature map are filled with zeros to spread the feature vector information to each group.
B4、通过卷积操作将上述稀疏特征图扩散为稠密的特征图后输出。B4. Use the convolution operation to diffuse the above sparse feature map into a dense feature map and output it.
可选的,对上述稀疏特征图进行卷积处理,以将上述稀疏特征图的信息扩散成稠密的特征图,使生成的稠密的特征图中各个像素位置的信息都描述了输入图像的所有像素的信息。Optionally, perform convolution processing on the above sparse feature map to diffuse the information of the above sparse feature map into a dense feature map, so that the information of each pixel position in the generated dense feature map describes all pixels of the input image. Information.
可选的,基于第一分支的深度可分离卷积对输入图像提取局部特征,基于第二分支进行上述步骤B1-B4,对输入图像提取全局特征,并将提取到的局部特征和全局特征进行特征合并,得到全局特征。例如,如图4所示,输入图像大小为8*8,通过尺寸k为3*3、步长s为1、边缘补充p为1的卷积核对输入图像进行深度可分离卷积处理,提取局部特征,同时,采用每4*4个像素为一组,将输入图像分为4组,采用尺寸为4*4、步长为4、边缘补充为0的卷积核对输入图像进行大步幅的卷积处理,得到4条特征向量,将得到的4条特征向量进行通道混洗,即将其打乱后,重新组合得到新的4条特征向量,将这4条新的特征向量进行稀疏化处理,构成稀疏特征图,再采用尺寸为4*3、步长s为、边缘补充为2的卷积核对稀疏特征图进行卷积处理,将稀疏特征图的信息进行传播扩散,得到稠密的特征图(即全局特征),并将提取到的局部特征和全局特征按元素位置相加合并,得到最终输出的全局特征。Optionally, extract local features from the input image based on the depth-separable convolution of the first branch, perform the above steps B1-B4 based on the second branch, extract global features from the input image, and compare the extracted local features and global features. Features are merged to obtain global features. For example, as shown in Figure 4, the input image size is 8*8. The input image is subjected to depth separable convolution processing through a convolution kernel with size k of 3*3, step size s of 1, and edge supplement p of 1, and the extraction Local features, at the same time, use every 4*4 pixels as a group to divide the input image into 4 groups, and use a convolution kernel with a size of 4*4, a stride of 4, and an edge complement of 0 to perform a large stride on the input image Convolution processing, obtain 4 feature vectors, perform channel shuffling on the 4 obtained feature vectors, that is, scramble them, recombine them to obtain 4 new feature vectors, and sparse these 4 new feature vectors Process to form a sparse feature map, and then use a convolution kernel with a size of 4*3, a step size of s, and an edge complement of 2 to convolve the sparse feature map, spread the information of the sparse feature map, and obtain dense features map (i.e., global features), and the extracted local features and global features are added and merged according to element positions to obtain the final output global features.
本申请实施例中,通过对输入图像进行卷积、通道混洗、稀疏化扩散的处理,得到每个像素都描述了待识别图像的所有像素的信息的全局特征,由于在该提取全局特征的过程中全程采用纯卷积运算,使得神经网络模型在基于Meta Former结构提取全局特征以提高图像识别精度的同时,具备良好的硬件支持能力,从而便于在终端设备上部署神经网络模型。In the embodiment of the present application, by performing convolution, channel shuffling, and sparse diffusion processing on the input image, global features are obtained in which each pixel describes the information of all pixels of the image to be recognized. Since the global features are extracted Pure convolution operations are used throughout the process, allowing the neural network model to extract global features based on the Meta Former structure to improve image recognition accuracy while also having good hardware support capabilities, making it easy to deploy the neural network model on terminal devices.
在一些实施例中,上述识别模块包括图像分类子模型;In some embodiments, the above-mentioned recognition module includes an image classification sub-model;
相应的,上述步骤A2包括:Correspondingly, the above step A2 includes:
A21、基于上述图像识别子模型对提取到的特征进行分类处理,得到图 像分类结果。A21. Classify the extracted features based on the above image recognition sub-model to obtain image classification results.
图像分类是指根据图像中所反映的不同特征,将不同类别的目标区分开来。Image classification refers to distinguishing different categories of targets based on different features reflected in the image.
可选的,将提取到的融合特征输入经训练的图像分类子模型中,输出图像分类结果。例如在人脸识别中,在通过上述特征提取模块提取到人脸图像的融合特征后,通过图像分类子模型对提取到的人脸图像的融合特征进行分类,得到该人脸图像的类别,即具体属于哪一个人。Optionally, input the extracted fusion features into the trained image classification sub-model and output the image classification result. For example, in face recognition, after the fusion features of the face image are extracted through the above feature extraction module, the extracted fusion features of the face image are classified through the image classification sub-model to obtain the category of the face image, that is, Which specific person it belongs to.
本申请实施例中,通过经训练的图像分类子模型对提取到的融合特征进行分类处理,得到图像分类结果,由于是基于融合特征进行分类处理,融合特征描述了更多的语义、细节信息,使得图像分类子模型对于目标的分类结果更加准确,从而提高神经网络模型的识别准确率。In the embodiment of this application, the extracted fusion features are classified through the trained image classification sub-model to obtain the image classification results. Since the classification processing is based on the fusion features, the fusion features describe more semantic and detailed information. This makes the image classification sub-model more accurate in classifying targets, thereby improving the recognition accuracy of the neural network model.
在一些实施例中,上述图像分类子模型包括第三卷积模块;In some embodiments, the above image classification sub-model includes a third convolution module;
相应的,上述步骤A21包括:Correspondingly, the above step A21 includes:
基于上述第三卷积模块对提取到的特征进行分类处理,得到图像分类结果。Based on the above third convolution module, the extracted features are classified and processed to obtain image classification results.
可选的,上述第三卷积模块基于卷积构建全连接层,以对提取得到特征进行分类。Optionally, the above-mentioned third convolution module constructs a fully connected layer based on convolution to classify the extracted features.
可选的,如图5所示的用于图像分类任务的神经网络模型,在基于特征提取模块通过第一卷积模块C1对待识别图像进行局部特征提取,得到第一局部特征,第二卷积模块C2和全局特征提取模块E分别堆叠N 1和M 1次,对第一局部特征图像特征进行特征提取,并通过融合模块F将得到的第二局部特征和全局特征进行融合,得到融合特征,第三卷积模块C(即全连接层)获取图像分类任务所需的融合特征,基于上述全连接层将提取到的融合特征映射到样本标记空间,得到待识别图像属于各个类别的概率值,选取概率值最大的标签作为图像分类结果,从而实现图像分类任务。 Optionally, as shown in Figure 5, the neural network model used for image classification tasks uses the feature extraction module to extract local features of the image to be identified through the first convolution module C1 to obtain the first local features and the second convolution Module C2 and global feature extraction module E are stacked N 1 and M 1 times respectively to extract the first local feature image features, and fuse the obtained second local features and global features through the fusion module F to obtain the fusion feature. The third convolution module C (that is, the fully connected layer) obtains the fusion features required for the image classification task. Based on the above fully connected layer, the extracted fusion features are mapped to the sample label space to obtain the probability value of the image to be identified belonging to each category. Select the label with the largest probability value as the image classification result to achieve the image classification task.
本申请实施例中,通过卷积操作对提取到的特征进行分类,并根据样本标记得到分类结果。In the embodiment of the present application, the extracted features are classified through a convolution operation, and the classification result is obtained according to the sample label.
在一些实施例中,在基于上述图像分类子模型的神经网络模型进行图像分类任务之前,还包括:In some embodiments, before performing the image classification task based on the neural network model of the above image classification sub-model, it also includes:
对包括上述图像分类子模型的神经网络模型进行图像分类训练。The neural network model including the above image classification sub-model is trained for image classification.
可选的,采用常用训练方式对上述包括图像分类子模型的神经网络模型进行训练,以实现图像分类任务。例如,根据部署的应用的实际图像分类需求,获取相应的训练集,并基于上述训练集对上述包括图像分类子模型的神经网络模型进行迭代训练,直至经训练的包括图像分类子模型的神经网络模型符合预设条件,得到训练后的神经网络模型。Optionally, the above-mentioned neural network model including the image classification sub-model is trained using common training methods to achieve the image classification task. For example, according to the actual image classification requirements of the deployed application, the corresponding training set is obtained, and the above-mentioned neural network model including the image classification sub-model is iteratively trained based on the above-mentioned training set until the trained neural network including the image classification sub-model is The model meets the preset conditions and the trained neural network model is obtained.
本申请实施例中,根据实际部署的应用需求,对上述包括图像分类子模型的神经网络模型进行图像分类训练,以实现部署应用的图像分类任务。In the embodiment of the present application, according to the actual deployment application requirements, the above-mentioned neural network model including the image classification sub-model is trained for image classification, so as to realize the image classification task of the deployed application.
在一些实施例中,上述识别模块还包括目标检测子模型;In some embodiments, the above recognition module also includes a target detection sub-model;
相应的,上述步骤A2还包括:Correspondingly, the above step A2 also includes:
A22、基于上述目标检测子模型对提取到的特征进行目标检测处理,得到目标检测结果。A22. Based on the above target detection sub-model, perform target detection processing on the extracted features to obtain target detection results.
可选的,由于目标检测的任务是找出图像中所有感兴趣的目标(物体),确定目标的类别和位置,而目标的形状和大小不一,因此,基于SSD(Single Shot MultiBox Detector)算法的多尺度检测,接收提取到的不同大小的特征,并采用不同尺度和长宽比的先验框对提取到的不同大小的特征进行检测,从而输出各个检测框中每个类别的置信度以及检测框相当于先验框的偏移量,即检测框的位置信息。Optional, since the task of target detection is to find all the targets (objects) of interest in the image and determine the category and location of the targets, and the shapes and sizes of the targets vary, therefore, based on the SSD (Single Shot MultiBox Detector) algorithm Multi-scale detection, receives extracted features of different sizes, and uses a priori frames of different scales and aspect ratios to detect the extracted features of different sizes, thereby outputting the confidence of each category in each detection frame and The detection frame is equivalent to the offset of the a priori frame, that is, the position information of the detection frame.
上述SSD算法是One-stage目标检测算法中的一种,通过在图像的不同位置进行密集采样,在采样时使用不同尺度和长宽比设置先验框,然后利用卷积神经网络提取先验框中的图像特征后直接进行分类和回归,具有检测速度快的优点。为了检测不同尺度的目标,SSD算法采用了网格划分的思想,对不同卷积层的特征图进行扫描,从而根据不同尺度的特征图检测不同大小的目标,即基于大尺度特征图检测小物体,基于小尺度特征图检测大物体,从而提高检测准确率。The above SSD algorithm is one of the One-stage target detection algorithms. It performs dense sampling at different positions of the image, uses different scales and aspect ratios to set a priori frames during sampling, and then uses a convolutional neural network to extract the a priori frames. The image features in the image are directly classified and regressed, which has the advantage of fast detection speed. In order to detect targets of different scales, the SSD algorithm uses the idea of grid division to scan the feature maps of different convolution layers to detect targets of different sizes based on feature maps of different scales, that is, to detect small objects based on large-scale feature maps. , detect large objects based on small-scale feature maps, thereby improving detection accuracy.
具体地,基于上述目标检测子模型,接收提取到的待识别图像的不同尺度的第一局部特征和融合特征,通过目标检测头对上述第一局部特征和上述融合特征采用不同尺度和长宽比的先验框进行目标检测,得到各个检测框中每个类别的置信度以及检测框相当于先验框的偏移量。例如,由于人脸识别需要检测出待识别图像中的人脸,进而对检测到的人脸进行识别,将上述神经网络模型应用在如门禁***等人脸识别应用上,基于上述目标检测子模型 将提取到的待识别图像的第一局部特征和融合特征采用不同尺度和长宽比的先验框进行检测,得到包含人脸的检测框,通过该检测框将人脸识别后续的人脸识别算法的处理区域从整个图像限制在检测框的人脸区域。Specifically, based on the above target detection sub-model, the first local features and fusion features of different scales of the extracted image to be recognized are received, and the target detection head adopts different scales and aspect ratios for the above first local features and the above fusion features. The a priori frame is used for target detection, and the confidence of each category in each detection frame and the offset of the detection frame equivalent to the a priori frame are obtained. For example, since face recognition needs to detect the face in the image to be recognized, and then recognize the detected face, the above neural network model is applied to face recognition applications such as access control systems, based on the above target detection sub-model The first local features and fusion features of the extracted image to be recognized are detected using a priori frames of different scales and aspect ratios to obtain a detection frame containing the face, through which the face will be recognized for subsequent face recognition. The processing area of the algorithm is limited from the entire image to the face area of the detection frame.
可选的,为确定形状和大小不一的目标的类别和位置,还可以通过对待检测图像进行不同尺度的缩放、金字塔特征预测等方式实现对待检测图像的多尺度检测。上述多尺度检测方式为常规检测方式,在此不一一赘述。Optionally, in order to determine the categories and locations of targets of different shapes and sizes, multi-scale detection of the image to be detected can also be achieved by scaling the image to be detected at different scales, predicting pyramid features, etc. The above-mentioned multi-scale detection methods are conventional detection methods and will not be described in detail here.
本申请实施例中,由于基于不同尺度特征图以及采用不同尺度和长宽比的先验框进行目标检测,可检测到不同大小的目标,从而提高了目标检测的准确率,进而提高上述神经网络模型的准确率。In the embodiments of the present application, target detection is based on feature maps of different scales and using a priori frames of different scales and aspect ratios, so targets of different sizes can be detected, thereby improving the accuracy of target detection and thereby improving the above-mentioned neural network. The accuracy of the model.
在一些实施例中,上述步骤A22具体包括:In some embodiments, the above step A22 specifically includes:
通过上述目标检测子模型基于非极大值抑制(Non-Maximum Suppression,NMS)算法检测图像中目标所在位置和/或目标所属类别。The above target detection sub-model detects the location of the target in the image and/or the category of the target based on the Non-Maximum Suppression (NMS) algorithm.
可选的,由于在目标检测过程中,在同一目标位置会产生大量的检测框,这些检测框相互之间会存在重叠现象,同一目标我们通常只需要一个检测框,多余的检测框会影响目标检测的精度,因此,如图6所示的用于目标检测任务的神经网络模型,在通过第一卷积模块C1对待识别图像进行局部特征提取,得到第一局部特征,第二卷积模块C2和全局特征提取模块E分别堆叠N 1和M 1次,对第一局部特征图像特征进行特征提取,并通过融合模块F将得到的第二局部特征和全局特征进行融合,得到融合特征,通过目标检测头对提取到的不同尺度的第一局部特征和融合特征进行检测,得到待识别图像中目标的检测框以及概率值,通过非极大值抑制NMS算法过滤掉重叠的检测框,得到最佳的检测框,从而确定待识别图像中目标所在位置和/或目标的所属类别。例如,如图7所示车辆,定位图中车辆位置的过程中得到较多的检测框,需要基于非极大值抑制判别无用的检测框,从而得到车辆的检测结果,如,假设有6个检测框,其属于车辆的概率值从小到大分别为A、B、C、D、E、F,从最大概率检测框F开始,分别判断检测框A、B、C、D、E与F的重叠度IOU是否大于预设阈值,假设检测框B、D与F的重叠度超出预设阈值,则去除检测框B、D,并标记F为留下的检测框,从剩下的检测框A、C、E中,选择概率最大的检测框E,判断检测框A、C与E的重叠度,将重叠度大于预设阈值的检测框去除,标记E为留下的第二个检测框,重复该过程, 找到所有被保留的检测框。 Optional, because during the target detection process, a large number of detection frames will be generated at the same target position, and these detection frames will overlap with each other. We usually only need one detection frame for the same target, and the redundant detection frames will affect the target. Therefore, the neural network model used for target detection tasks as shown in Figure 6 extracts local features of the image to be identified through the first convolution module C1 to obtain the first local features, and the second convolution module C2 and global feature extraction module E are stacked N 1 and M 1 times respectively to extract the first local feature image features, and fuse the obtained second local features and global features through the fusion module F to obtain the fusion features, through the target The detection head detects the extracted first local features and fusion features of different scales to obtain the detection frame and probability value of the target in the image to be recognized, and filters out overlapping detection frames through the non-maximum suppression NMS algorithm to obtain the best detection frame to determine the location of the target in the image to be recognized and/or the category to which the target belongs. For example, for the vehicle shown in Figure 7, many detection frames are obtained during the process of locating the vehicle position in the map. It is necessary to identify useless detection frames based on non-maximum suppression to obtain the detection results of the vehicle. For example, suppose there are 6 The probability values of the detection frames belonging to the vehicle from small to large are A, B, C, D, E, and F. Starting from the maximum probability detection frame F, determine the detection frames A, B, C, D, E, and F respectively. Whether the overlap degree IOU is greater than the preset threshold. Assume that the overlap degree of detection frames B, D and F exceeds the preset threshold, then remove detection frames B and D, and mark F as the remaining detection frame. From the remaining detection frame A , C, and E, select the detection frame E with the highest probability, determine the overlap between detection frames A, C, and E, remove the detection frames whose overlap is greater than the preset threshold, and mark E as the second remaining detection frame. Repeat this process to find all retained detection boxes.
上述非极大值抑制是一种边缘细化技术,用于抑制不是极大值的目标,从而搜索出局部极大值的目标(最优)。The above-mentioned non-maximum suppression is an edge refinement technology used to suppress targets that are not maximum values, thereby searching for targets with local maximum values (optimal).
可选的,在去除上述冗余的检测框时,还可以根据实际需求选择去除冗余检测框的方法,如多目标检测时,基于模板匹配、聚类算法等方法得到所需的检测框,从而获取相应的检测结果。Optionally, when removing the above redundant detection frames, you can also choose a method to remove redundant detection frames according to actual needs. For example, in multi-target detection, the required detection frames can be obtained based on template matching, clustering algorithms and other methods. To obtain the corresponding test results.
本申请实施例中,由于同一目标会产生多个检测框,存在重叠现象,因此,基于非极大值抑制将重叠的检测框去除,得到最佳的检测框,从而得到图像中目标所在位置和/或目标所属类别,减少了干扰项,使得到的目标检测结果更加准确。In the embodiment of this application, since the same target will generate multiple detection frames and overlap, the overlapping detection frames are removed based on non-maximum suppression to obtain the best detection frame, thereby obtaining the location and location of the target in the image. /or the category to which the target belongs reduces interference items, making the target detection results more accurate.
在一些实施例中,在上述包括目标检测子模型的神经网络模型进行目标检测任务之前,还包括:In some embodiments, before the above neural network model including the target detection sub-model performs the target detection task, it also includes:
对上述包括目标检测子模型的神经网络模型进行目标检测训练。Perform target detection training on the above neural network model including the target detection sub-model.
可选的,采用常用训练方式对上述包括目标检测子模型的神经网络模型进行训练,以实现目标检测任务。例如,根据部署的应用的实际目标检测需求,获取带标注的训练集,并基于上述训练集训练两个或多个包括上述目标检测子模型的神经网络模型,将得到的两个或多个包括目标检测子模型的神经网络模型进行融合,得到训练后的神经网络模型。Optionally, the above-mentioned neural network model including the target detection sub-model is trained using common training methods to achieve the target detection task. For example, according to the actual target detection requirements of the deployed application, obtain a labeled training set, and train two or more neural network models including the above target detection sub-model based on the above training set. The obtained two or more including The neural network models of the target detection sub-model are fused to obtain the trained neural network model.
本申请实施例中,根据实际部署的应用需求,对上述包括目标检测子模型的神经网络模型进行目标检测训练,以实现部署应用的目标检测任务。In the embodiment of the present application, according to the actual deployment application requirements, the above-mentioned neural network model including the target detection sub-model is subjected to target detection training to achieve the target detection task of the deployed application.
需要指出的是,在神经网络模型包括图像分类子模型和目标检测子模型时,可分别对图像分类子模型和目标检测子模型进行相应训练后融合,以实现图像分类和目标检测任务。It should be pointed out that when the neural network model includes an image classification sub-model and a target detection sub-model, the image classification sub-model and the target detection sub-model can be trained separately and then fused to achieve image classification and target detection tasks.
在一些实施例中,上述识别模块还包括语义分割子模型;In some embodiments, the above recognition module also includes a semantic segmentation sub-model;
相应的,上述步骤A2还包括:Correspondingly, the above step A2 also includes:
A23、基于上述语义分割子模型对提取到的特征进行分割处理,得到语义分割结果。A23. Based on the above semantic segmentation sub-model, perform segmentation processing on the extracted features to obtain semantic segmentation results.
语义分割结合了图像分类、目标检测和图像分割技术,通过一定的方法将图像分割成具有一定语义含义的区域快,并识别出各个语义块的语义类别,得到具有逐像素语义标注的分割图像。Semantic segmentation combines image classification, target detection and image segmentation technology. It uses a certain method to segment the image into regions with certain semantic meaning, and identifies the semantic category of each semantic block to obtain a segmented image with pixel-by-pixel semantic annotation.
可选的,将提取到的融合特征输入经训练的语义分割子模型进行分割处理,输出相应的语义分割结果。例如在人脸识别中,与面部分割相关的任务中,面部的语义分割通常涉及如皮肤、头发、眼睛、嘴巴、鼻子以及背景等的分类,通过神经网络模型提取待识别图像的融合特征,由于上述融合特征中包含了待识别图像的语义、细节等信息,因此,通过经训练的语义分割子模型对融合特征进行分割处理,得到面部的语义分割结果。Optionally, input the extracted fusion features into the trained semantic segmentation sub-model for segmentation processing, and output the corresponding semantic segmentation results. For example, in face recognition, in tasks related to facial segmentation, the semantic segmentation of faces usually involves the classification of skin, hair, eyes, mouth, nose, background, etc., and the fusion features of the image to be recognized are extracted through the neural network model. The above-mentioned fusion features contain information such as semantics and details of the image to be recognized. Therefore, the fusion features are segmented through the trained semantic segmentation sub-model to obtain the semantic segmentation results of the face.
本申请实施例中,由于融合特征中包含了语义、位置、细节等信息,因此,基于融合特征进行逐像素的语义分割处理,使得语义分割的分割结果更为准确。In the embodiment of the present application, since the fusion features contain information such as semantics, location, details, etc., pixel-by-pixel semantic segmentation processing is performed based on the fusion features, making the segmentation results of semantic segmentation more accurate.
在一些实施例中,上述语义分割子模型包括分割模块、合并模块和第四卷积模块;In some embodiments, the above-mentioned semantic segmentation sub-model includes a segmentation module, a merging module and a fourth convolution module;
相应的,上述步骤A23包括:Correspondingly, the above step A23 includes:
基于上述分割模块对提取到的特征进行多尺度卷积处理,得到不同大小的特征图;Based on the above segmentation module, multi-scale convolution processing is performed on the extracted features to obtain feature maps of different sizes;
基于上述合并模块沿通道方向对上述不同大小的特征图进行合并;Based on the above merging module, the above feature maps of different sizes are merged along the channel direction;
基于上述第四卷积模块对上述合并模块输出的特征图进行卷积处理,得到语义分割结果。Based on the above fourth convolution module, the feature map output by the above merging module is convolved to obtain a semantic segmentation result.
可选的,由于大物体在小尺度特征图上的检测效果较好,小物体在大尺度特征图上的检测效果较好,因此,采用如图8所示的用于进行语义分割任务的神经网络模型,在通过第一卷积模块C1对待识别图像进行局部特征提取,得到第一局部特征,第二卷积模块C2和全局特征提取模块E分别堆叠N 1和M 1次,对第一局部特征图像特征进行特征提取,并通过融合模块F将得到的第二局部特征和全局特征进行融合,得到融合特征后,通过分割模块的卷积分支对提取到的融合特征进行多尺度卷积处理,得到不同大小的特征图,并通过合并模块沿通道方向将得到的不同大小的特征图进行合并处理,通过第四卷积模块C将合并得到的特征图进行1*1卷积处理,以实现特征图信息的跨通道融合,从而输出分割结果。 Optionally, since large objects are better detected on small-scale feature maps and small objects are better detected on large-scale feature maps, the neural network used for semantic segmentation tasks as shown in Figure 8 is used. The network model extracts local features of the image to be identified through the first convolution module C1 to obtain the first local features. The second convolution module C2 and the global feature extraction module E are stacked N 1 and M 1 times respectively to extract the first local features. The feature image features are used for feature extraction, and the obtained second local features and global features are fused through the fusion module F. After the fusion features are obtained, the extracted fusion features are subjected to multi-scale convolution processing through the convolution branch of the segmentation module. Feature maps of different sizes are obtained, and the feature maps of different sizes are merged along the channel direction through the merging module, and the merged feature maps are subjected to 1*1 convolution processing through the fourth convolution module C to achieve feature Cross-channel fusion of graph information to output segmentation results.
本申请实施例中,由于对提取到的融合特征进行多尺度卷积处理并合并得到的不同大小的特征图进行语义分割,不同大小的特征图使得对不同尺度的物体的分割更为准确,从而提高了语义分割的精度,并且对合并得到的特 征图进行点卷积处理以使特征图的信息跨通道融合,进而提高了语义分割的准确率。In the embodiment of the present application, since the extracted fusion features are subjected to multi-scale convolution processing and the obtained feature maps of different sizes are merged for semantic segmentation, the feature maps of different sizes make the segmentation of objects of different scales more accurate, thus The accuracy of semantic segmentation is improved, and the merged feature map is subjected to point convolution processing to fuse the information of the feature map across channels, thereby improving the accuracy of semantic segmentation.
在一些实施例中,上述分割模块包括M个并行的卷积分支,最上层的卷积分支采用1*1卷积,其它卷积分支采用膨胀因子(Dilation factor)依次增大的膨胀卷积,其中,M为大于1的正整数;In some embodiments, the above-mentioned segmentation module includes M parallel convolution branches. The uppermost convolution branch adopts 1*1 convolution, and other convolution branches adopt dilated convolutions with sequentially increasing dilation factors. Among them, M is a positive integer greater than 1;
相应的,上述步骤A23基于上述分割模块对提取到的特征进行多尺度卷积处理时,包括:Correspondingly, when the above-mentioned step A23 performs multi-scale convolution processing on the extracted features based on the above-mentioned segmentation module, it includes:
基于上述卷积分支对提取到的特征进行多尺度卷积处理。Based on the above convolution branch, multi-scale convolution processing is performed on the extracted features.
可选的,由于语义分割是像素级的分类任务,通过语义信息指导像素分类,需要获取高分辨率和语义信息丰富的特征图像,而膨胀卷积可以在有效扩大语义分割子模型的感受野大小的同时,不增加语义分割子模型的模型参数,因此,通过采用并行的不同膨胀因子的膨胀卷积来提取多个尺度下的语义特征。例如,分割模块采用如图9所示的卷积分支结构,最上层的卷积分支采用1*1卷积,其它三层采用卷积核尺寸为3*3,膨胀因子分别为6、12、18的膨胀卷积以对融合特征进行特征提取,从而获取语义特征更丰富的不同尺度的特征图。Optional, since semantic segmentation is a pixel-level classification task, guiding pixel classification through semantic information requires obtaining high-resolution and semantic-information-rich feature images, and dilated convolution can effectively expand the receptive field size of the semantic segmentation sub-model. At the same time, the model parameters of the semantic segmentation sub-model are not increased. Therefore, semantic features at multiple scales are extracted by using parallel dilated convolutions with different dilation factors. For example, the segmentation module uses the convolution branch structure as shown in Figure 9. The top layer of the convolution branch uses 1*1 convolution, the other three layers use a convolution kernel size of 3*3, and the expansion factors are 6, 12, 18 dilated convolutions are used to extract features from the fused features, thereby obtaining feature maps of different scales with richer semantic features.
本申请实施例中,由于膨胀卷积可在扩大感受野大小的同时,不增加神经网络模型的参数数量,因此,通过采用不同膨胀因子的膨胀卷积,对融合特征进行特征提取,获取语义信息丰富的不同尺度的特征图,能够有效提高语义分割的精度,从而提高了神经网络模型的识别精度。In the embodiments of this application, since dilated convolution can expand the size of the receptive field without increasing the number of parameters of the neural network model, dilated convolutions with different dilation factors are used to extract features from the fusion features and obtain semantic information. Rich feature maps of different scales can effectively improve the accuracy of semantic segmentation, thereby improving the recognition accuracy of the neural network model.
在一些实施例中,上述包括语义分割子模型的神经网络模型在进行语义分割任务之前,还包括:In some embodiments, before performing the semantic segmentation task, the above-mentioned neural network model including the semantic segmentation sub-model also includes:
对包括上述语义分割子模型的神经网络模型进行语义分割训练。Semantic segmentation training is performed on the neural network model including the above semantic segmentation sub-model.
可选的,采用常规训练方式对上述包括语义分割子模型的神经网络模型进行训练,以实现语义分割任务。Optionally, the above-mentioned neural network model including the semantic segmentation sub-model is trained using conventional training methods to achieve the semantic segmentation task.
需要指出的是,上述神经网络模型的识别模块可能包括图像分类子模型、目标检测子模型以及语义分割子模型中的一个或多个,但不限于此。在实际应用中,根据部署应用的图像识别任务需求设置神经网络模型中识别模块的具体功能,基于上述基于Meta Former结构的特征提取模块提取特征从而进行相应的图像识别任务。It should be noted that the recognition module of the above-mentioned neural network model may include one or more of the image classification sub-model, target detection sub-model and semantic segmentation sub-model, but is not limited to this. In practical applications, the specific functions of the recognition module in the neural network model are set according to the image recognition task requirements of the deployed application, and features are extracted based on the above-mentioned feature extraction module based on the Meta Former structure to perform the corresponding image recognition task.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the sequence number of each step in the above embodiment does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
实施例二:Example 2:
对应于上文实施例所述的图像处理方法,图10示出了本申请实施例提供的装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the image processing method described in the above embodiment, FIG. 10 shows a structural block diagram of the device provided by the embodiment of the present application. For convenience of explanation, only the parts related to the embodiment of the present application are shown.
参照图10,该装置包括:输入模块101、经训练的神经网络模型102,上述神经网络模型基于纯卷积的Meta Former结构提取待识别图像的全局特征。其中,Referring to Figure 10, the device includes: an input module 101 and a trained neural network model 102. The above neural network model extracts global features of the image to be recognized based on a purely convolutional Meta Former structure. in,
输入模块101,用于将待识别图像输入经训练的上述神经网络模型;The input module 101 is used to input the image to be recognized into the trained neural network model;
神经网络模型102,用于对上述待识别图像依次进行特征提取和识别,得到识别结果。The neural network model 102 is used to sequentially perform feature extraction and recognition on the above-mentioned images to be recognized to obtain recognition results.
本申请实施例中,通过经训练的神经网络模型对待检测图像依次进行特征提取和识别,从而输出相应的图像识别结果,由于上述神经网络模型基于卷积的Meta Former结构待识别图像的全局特征,因此,上述神经网络模型在进行图像识别任务时可关注到待识别图像的全局特征,从而提高上述神经网络模型的图像识别精度,避免了轻量化造成的神经网络模型精度下降的问题,从而提高了神经网络模型的识别准确率,进而提高基于上述神经网络模型的图像识别的相应应用的实际部署效果。In the embodiment of this application, the trained neural network model is used to sequentially perform feature extraction and recognition on the image to be detected, thereby outputting the corresponding image recognition results. Since the above neural network model is based on the global features of the convolutional Meta Former structure of the image to be recognized, Therefore, when performing image recognition tasks, the above-mentioned neural network model can pay attention to the global characteristics of the image to be recognized, thereby improving the image recognition accuracy of the above-mentioned neural network model, avoiding the problem of reduced accuracy of the neural network model caused by lightweighting, thereby improving the accuracy of the neural network model. The recognition accuracy of the neural network model is improved, thereby improving the actual deployment effect of the corresponding application of image recognition based on the above neural network model.
在一些实施例中,上述图像识别装置还包括:In some embodiments, the above image recognition device further includes:
图像获取模块,用于获取待识别图像。Image acquisition module, used to acquire images to be recognized.
在一些丧失理智,上述神经网络模型包括:In some loss of sanity, the above mentioned neural network models include:
特征提取模块,用于通过上述神经网络模型对待识别图像进行特征提取。The feature extraction module is used to extract features from the image to be recognized through the above-mentioned neural network model.
识别模块,用于通过上述神经网络模型对待识别图像进行识别,得到识别结果。The recognition module is used to recognize the image to be recognized through the above neural network model and obtain the recognition result.
在一些实施例中,上述特征提取模块包括:In some embodiments, the above feature extraction module includes:
全局特征提取模块,用于基于纯卷积的Meta Former结构提取待识别图像的全局特征。The global feature extraction module is used to extract the global features of the image to be recognized based on the Meta Former structure of pure convolution.
在一些实施例中,上述特征提取模块还包括:In some embodiments, the above feature extraction module also includes:
第一卷积模块,用于对上述待识别图像进行局部特征提取,得到第一局部特征图像。The first convolution module is used to extract local features of the above-mentioned image to be recognized to obtain a first local feature image.
相应的,上述全局特征提取模块用于对上述第一局部特征图像进行全局特征提取,得到全局特征。Correspondingly, the above-mentioned global feature extraction module is used to extract global features from the above-mentioned first local feature image to obtain global features.
在一些实施例中,上述特征提取模块还包括:In some embodiments, the above feature extraction module also includes:
第二卷积模块,用于对上述第一局部特征图像进行局部特征提取,得到第二局部特征。The second convolution module is used to extract local features from the above-mentioned first local feature image to obtain second local features.
融合模块,用于将上述第二局部特征与上述全局特征进行融合,得到融合特征。The fusion module is used to fuse the above-mentioned second local features with the above-mentioned global features to obtain fusion features.
在一些实施例中,上述全局特征提取模块包括:In some embodiments, the above-mentioned global feature extraction module includes:
第一加并模块,用于对第一残差模块的输入数据和上述全局特征子提取模块的输出数据进行相加合并处理。The first addition and merging module is used for adding and merging the input data of the first residual module and the output data of the above-mentioned global feature extraction module.
第二加并模块,用于对第二残差模块的输入数据和上述前馈网络模块的输出数据进行相加合并处理。The second addition and merging module is used for adding and merging the input data of the second residual module and the output data of the above-mentioned feedforward network module.
全局特征子提取模块,用于对输入数据进行全局特征提取。The global feature sub-extraction module is used to extract global features from the input data.
上述全局特征子提取模块包括:The above-mentioned global feature sub-extraction module includes:
第一分支,用于通过深度可分离卷积对输入图像进行局部特征提取。The first branch is used to extract local features of the input image through depthwise separable convolution.
第二分支,用于对输入图像进行全局特征提取。The second branch is used to extract global features from the input image.
合并模块,用于根据像素位置对上述第一分支和上述第二分支输出的特征进行合并处理,得到全局特征。The merging module is used to merge the features output by the above-mentioned first branch and the above-mentioned second branch according to the pixel position to obtain global features.
在一些实施例中,上述第二分支包括:In some embodiments, the above-mentioned second branch includes:
卷积单元,用于对输入图像进行卷积操作,得到N条特征向量,其中,N为大于1的正整数。The convolution unit is used to perform a convolution operation on the input image to obtain N feature vectors, where N is a positive integer greater than 1.
通道混洗单元,用于将各条特征向量进行通道混洗操作,得到新的N条特征向量。The channel shuffling unit is used to perform channel shuffling operations on each feature vector to obtain new N feature vectors.
稀疏化单元,用于将上述新的N条特征向量进行稀疏化重排,得到稀疏特征图。The sparsification unit is used to sparsely rearrange the above-mentioned new N feature vectors to obtain a sparse feature map.
扩散单元,用于通过卷积操作将上述稀疏特征图扩散为稠密的特征图后输出。The diffusion unit is used to diffuse the above sparse feature map into a dense feature map through a convolution operation and then output it.
在一些实施例中,上述识别模块包括:In some embodiments, the above-mentioned identification module includes:
图像分类子模型,用于基于上述图像识别子模型对提取到的特征进行分类处理,得到图像分类结果。The image classification sub-model is used to classify the extracted features based on the above-mentioned image recognition sub-model to obtain image classification results.
在一些实施例中,上述图像分类模块包括:In some embodiments, the above image classification module includes:
第三卷积单元,用于对提取到的特征进行分类处理,得到图像分类结果。The third convolution unit is used to classify the extracted features to obtain image classification results.
在一些实施例中,上述识别模块还包括:In some embodiments, the above-mentioned identification module also includes:
目标检测子模型,用于基于上述目标检测子模型对提取到的特征进行目标检测处理,得到目标检测结果。The target detection sub-model is used to perform target detection processing on the extracted features based on the above-mentioned target detection sub-model to obtain target detection results.
在一些实施例中,上述目标检测子模型包括:In some embodiments, the above-mentioned target detection sub-model includes:
检测单元,用于基于非极大值抑制(Non-Maximum Suppression,NMS)算法检测图像中目标所在位置和/或目标所属类别。The detection unit is used to detect the location of the target and/or the category of the target in the image based on the Non-Maximum Suppression (NMS) algorithm.
在一些实施例中,上述识别模块还包括:In some embodiments, the above-mentioned identification module also includes:
语义分割子模型,用于基于上述语义分割子模型对提取到的特征进行分割处理,得到语义分割结果。The semantic segmentation sub-model is used to segment the extracted features based on the above-mentioned semantic segmentation sub-model to obtain semantic segmentation results.
在一些实施例中,上述语义分割子模型包括:In some embodiments, the above-mentioned semantic segmentation sub-model includes:
多尺度卷积单元,用于对提取到的特征进行多尺度卷积处理,得到不同大小的特征图。The multi-scale convolution unit is used to perform multi-scale convolution processing on the extracted features to obtain feature maps of different sizes.
特征合并单元,用于沿通道方向对上述不同大小的特征图进行合并。The feature merging unit is used to merge the above-mentioned feature maps of different sizes along the channel direction.
第四卷积单元,用于对上述特征合并单元输出的特征图进行卷积处理,得到语义分割结果。The fourth convolution unit is used to perform convolution processing on the feature map output by the above-mentioned feature merging unit to obtain the semantic segmentation result.
在一些实施例中,上述多尺度卷积单元包括:In some embodiments, the above-mentioned multi-scale convolution unit includes:
卷积分支单元,用于基于并行的卷积分支对上述融合特征进行特征提取。The convolution branch unit is used to extract features based on the parallel convolution branch.
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。It should be noted that the information interaction, execution process, etc. between the above-mentioned devices/units are based on the same concept as the method embodiments of the present application. For details of their specific functions and technical effects, please refer to the method embodiments section. No further details will be given.
实施例三:Embodiment three:
图11为本申请一实施例提供的终端设备的结构示意图。如图11所示,该实施例的终端设备11包括:至少一个处理器110(图11中仅示出一个处理器)、存储器111以及存储在所述存储器111中并可在所述至少一个处理 器110上运行的计算机程序112,所述处理器110执行所述计算机程序112时实现上述任意各个方法实施例中的步骤。Figure 11 is a schematic structural diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 11 , the terminal device 11 of this embodiment includes: at least one processor 110 (only one processor is shown in FIG. 11 ), a memory 111 and data stored in the memory 111 and capable of processing in the at least one processor 110 . The computer program 112 runs on the processor 110. When the processor 110 executes the computer program 112, the steps in any of the above method embodiments are implemented.
示例性的,所述计算机程序112可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器111中,并由所述处理器110执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序112在所述终端设备11中的执行过程。例如,所述计算机程序112可以被分割成输入模块101和经训练的神经网络模型102,上述神经网络模型基于纯卷积的Meta Former结构提取待识别图像的全局特征,各模块之间具体功能如下:Exemplarily, the computer program 112 can be divided into one or more modules/units, the one or more modules/units are stored in the memory 111 and executed by the processor 110 to complete this application. The one or more modules/units may be a series of computer program instruction segments capable of completing specific functions. The instruction segments are used to describe the execution process of the computer program 112 in the terminal device 11 . For example, the computer program 112 can be divided into an input module 101 and a trained neural network model 102. The above neural network model extracts global features of the image to be recognized based on a purely convolutional Meta Former structure. The specific functions between each module are as follows :
输入模块101,用于将待识别图像输入经训练的上述神经网络模型;The input module 101 is used to input the image to be recognized into the trained neural network model;
神经网络模型102,用于对上述待识别图像依次进行特征提取和识别,得到识别结果。The neural network model 102 is used to sequentially perform feature extraction and recognition on the above-mentioned images to be recognized to obtain recognition results.
所述终端设备11可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。该终端设备可包括,但不仅限于,处理器110、存储器111。本领域技术人员可以理解,图11仅仅是终端设备11的举例,并不构成对终端设备11的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入输出设备、网络接入设备等。The terminal device 11 may be a computing device such as a desktop computer, a notebook, a PDA, a cloud server, etc. The terminal device may include, but is not limited to, a processor 110 and a memory 111. Those skilled in the art can understand that FIG. 11 is only an example of the terminal device 11 and does not constitute a limitation on the terminal device 11. It may include more or less components than shown in the figure, or some components may be combined, or different components may be used. , for example, it may also include input and output devices, network access devices, etc.
所称处理器110可以是中央处理单元(Central Processing Unit,CPU),该处理器110还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 110 can be a central processing unit (Central Processing Unit, CPU). The processor 110 can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit). , ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
所述存储器111在一些实施例中可以是所述终端设备11的内部存储单元,例如终端设备11的硬盘或内存。所述存储器111在另一些实施例中也可以是所述终端设备11的外部存储设备,例如所述终端设备11上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器111还可以既包括所述终端设备11的内部存储单元也包括外部存储设备。所述存储器111用于存储操作***、应用程序、引导装载程序(BootLoader)、数据以及其他程序等,例如所述计算机程序的程序代码等。所述存储器111还可以用于暂时地存储已经输出或者将要输出的数据。In some embodiments, the memory 111 may be an internal storage unit of the terminal device 11 , such as a hard disk or memory of the terminal device 11 . In other embodiments, the memory 111 may also be an external storage device of the terminal device 11, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital card equipped on the terminal device 11. (Secure Digital, SD) card, flash card (Flash Card), etc. Further, the memory 111 may also include both an internal storage unit of the terminal device 11 and an external storage device. The memory 111 is used to store operating systems, application programs, boot loaders, data and other programs, such as program codes of the computer programs. The memory 111 can also be used to temporarily store data that has been output or is to be output.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述***中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, only the division of the above functional units and modules is used as an example. In actual applications, the above functions can be allocated to different functional units and modules according to needs. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit. The above-mentioned integrated unit can be hardware-based. It can also be implemented in the form of software functional units. In addition, the specific names of each functional unit and module are only for the convenience of distinguishing each other and are not used to limit the scope of protection of the present application. For the specific working processes of the units and modules in the above system, please refer to the corresponding processes in the foregoing method embodiments, and will not be described again here.
本申请实施例还提供了一种网络设备,该网络设备包括:至少一个处理器、存储器以及存储在所述存储器中并可在所述至少一个处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任意各个方法实施例中的步骤。An embodiment of the present application also provides a network device. The network device includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor. The processor executes The computer program implements the steps in any of the above method embodiments.
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the steps in each of the above method embodiments can be implemented.
本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行时实现可实现上述各个方法实施例中的步骤。Embodiments of the present application provide a computer program product. When the computer program product is run on a terminal device, the steps in each of the above method embodiments can be implemented when the terminal device executes it.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号 以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, this application can implement all or part of the processes in the methods of the above embodiments by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. The computer program When executed by a processor, the steps of each of the above method embodiments may be implemented. Wherein, the computer program includes computer program code, which may be in the form of source code, object code, executable file or some intermediate form. The computer-readable medium may at least include: any entity or device capable of carrying computer program code to the camera device/terminal device, recording media, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. For example, U disk, mobile hard disk, magnetic disk or CD, etc. In some jurisdictions, subject to legislation and patent practice, computer-readable media may not be electrical carrier signals and telecommunications signals.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above embodiments, each embodiment is described with its own emphasis. For parts that are not detailed or documented in a certain embodiment, please refer to the relevant descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed devices/network devices and methods can be implemented in other ways. For example, the apparatus/network equipment embodiments described above are only illustrative. For example, the division of modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components can be combined or can be integrated into another system, or some features can be omitted, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-described embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions in the embodiments of this application, and should be included in within the protection scope of this application.

Claims (20)

  1. 一种基于神经网络模型的图像识别方法,其特征在于,所述神经网络模型通过基于纯卷积的Meta Former结构提取待识别图像的全局特征;An image recognition method based on a neural network model, characterized in that the neural network model extracts global features of the image to be recognized through a pure convolution-based Meta Former structure;
    所述图像识别方法包括:The image recognition method includes:
    将待识别图像输入经训练的所述神经网络模型,通过所述神经网络模型对待识别图像依次进行特征提取和识别,得到识别结果。The image to be recognized is input into the trained neural network model, and the features of the image to be recognized are sequentially extracted and recognized through the neural network model to obtain a recognition result.
  2. 如权利要求1所述的图像识别方法,其特征在于,所述神经网络模型包括特征提取模块和识别模块,所述特征提取模块包括基于纯卷积的MetaFormer结构构建的全局特征提取模块;The image recognition method according to claim 1, wherein the neural network model includes a feature extraction module and a recognition module, and the feature extraction module includes a global feature extraction module built based on a purely convolutional MetaFormer structure;
    所述通过所述神经网络模型对待识别图像依次进行特征提取和识别,得到识别结果,包括:The feature extraction and recognition of the image to be recognized are performed sequentially through the neural network model to obtain the recognition results, including:
    基于所述特征提取模块对待识别图像进行特征提取;Extract features from the image to be identified based on the feature extraction module;
    基于所述识别模块对提取到的特征进行识别,得到识别结果。The extracted features are identified based on the identification module to obtain identification results.
  3. 如权利要求2所述的图像识别方法,其特征在于,所述特征提取模块还包括第一卷积模块;The image recognition method according to claim 2, wherein the feature extraction module further includes a first convolution module;
    所述基于所述特征提取模块对待识别图像进行特征提取,包括:The feature extraction of the image to be identified based on the feature extraction module includes:
    基于所述第一卷积模块对待识别图像进行局部特征提取,得到第一局部特征图像;Extract local features of the image to be identified based on the first convolution module to obtain a first local feature image;
    基于所述全局特征提取模块对所述第一局部特征图像进行全局特征提取,得到全局特征。Based on the global feature extraction module, global features are extracted from the first local feature image to obtain global features.
  4. 如权利要求3所述的图像识别方法,其特征在于,所述特征提取模块还包括第二卷积模块、融合模块,所述第二卷积模块采用轻量级卷积神经网络提取第二局部特征;The image recognition method according to claim 3, characterized in that the feature extraction module further includes a second convolution module and a fusion module, and the second convolution module uses a lightweight convolutional neural network to extract the second part feature;
    所述基于所述特征提取模块对待识别图像进行特征提取,还包括:The feature extraction of the image to be identified based on the feature extraction module also includes:
    基于所述第二卷积模块对所述第一局部特征图像进行局部特征提取,得到第二局部特征;Perform local feature extraction on the first local feature image based on the second convolution module to obtain second local features;
    基于所述融合模块将所述第二局部特征与所述全局特征进行融合,得到融合特征。The second local feature and the global feature are fused based on the fusion module to obtain a fusion feature.
  5. 如权利要求2至4任一项所述的图像识别方法,其特征在于,所述全局特征提取模块由输入端至输出端包括依次连接的第一残差模块、全局特征 子提取模块、第一加并模块、第二残差模块、前馈网络模块和第二加并模块;The image recognition method according to any one of claims 2 to 4, wherein the global feature extraction module includes a first residual module, a global feature sub-extraction module, and a first residual module connected in sequence from the input end to the output end. The addition module, the second residual module, the feedforward network module and the second addition module;
    所述第一加并模块用于对所述第一残差模块的输入数据和所述全局特征子提取模块的输出数据进行相加合并处理;The first adding and merging module is used for adding and merging the input data of the first residual module and the output data of the global feature extraction module;
    所述第二加并模块用于对所述第二残差模块的输入数据和所述前馈网络模块的输出数据进行相加合并处理;The second addition module is used to perform addition and merging processing on the input data of the second residual module and the output data of the feedforward network module;
    所述全局特征子提取模块包括第一分支、第二分支和合并模块;The global feature sub-extraction module includes a first branch, a second branch and a merging module;
    所述第一分支用于通过深度可分离卷积对输入图像进行局部特征提取;The first branch is used to extract local features of the input image through depth-separable convolution;
    所述第二分支用于对输入图像进行全局特征提取;The second branch is used to extract global features from the input image;
    所述合并模块用于根据像素位置对所述第一分支和所述第二分支输出的特征进行合并处理,得到全局特征。The merging module is used to merge the features output by the first branch and the second branch according to the pixel position to obtain global features.
  6. 如权利要求5所述的图像识别方法,其特征在于,所述第二分支具体用于:对输入图像进行卷积操作,得到N条特征向量;将各条特征向量进行通道混洗操作,得到新的N条特征向量;将所述新的N条特征向量进行稀疏化重排,得到稀疏特征图;通过卷积操作将所述稀疏特征图扩散为稠密的特征图后输出,其中,N为大于1的正整数。The image recognition method according to claim 5, characterized in that the second branch is specifically used to: perform a convolution operation on the input image to obtain N feature vectors; perform a channel shuffling operation on each feature vector to obtain New N feature vectors; sparsely rearrange the new N feature vectors to obtain a sparse feature map; diffuse the sparse feature map into a dense feature map through a convolution operation and output it, where N is A positive integer greater than 1.
  7. 如权利要求5所述的图像识别方法,其特征在于,所述识别模块包括图像识别子模型;The image recognition method according to claim 5, wherein the recognition module includes an image recognition sub-model;
    所述基于所述识别模块对提取到的特征进行识别,得到识别结果,包括:The extracted features are identified based on the identification module and the identification results are obtained, including:
    基于所述图像识别子模型对提取到的特征进行分类处理,得到图像分类结果。Classify the extracted features based on the image recognition sub-model to obtain image classification results.
  8. 如权利要求7所述的图像识别方法,其特征在于,所述图像识别子模型包括第三卷积模块;The image recognition method according to claim 7, wherein the image recognition sub-model includes a third convolution module;
    所述基于所述图像识别子模型对提取到的特征进行分类处理,得到图像分类结果,包括:The extracted features are classified based on the image recognition sub-model to obtain image classification results, including:
    基于所述第三卷积模块对提取到的特征进行分类处理,得到图像分类结果。Classify the extracted features based on the third convolution module to obtain an image classification result.
  9. 如权利要求5所述的图像识别方法,其特征在于,所述识别模块还包括目标检测子模型;The image recognition method according to claim 5, wherein the recognition module further includes a target detection sub-model;
    所述基于所述识别模块对提取到的特征进行识别,得到识别结果,包括:The extracted features are identified based on the identification module and the identification results are obtained, including:
    基于所述目标检测子模型对提取到的特征进行目标检测处理,得到目标 检测结果。Based on the target detection sub-model, target detection processing is performed on the extracted features to obtain target detection results.
  10. 如权利要求9所述的图像识别方法,其特征在于,所述目标检测子模型基于非极大抑制算法检测图像中目标所在位置和/或目标所属类别。The image recognition method according to claim 9, wherein the target detection sub-model detects the location of the target in the image and/or the category to which the target belongs based on a non-maximum suppression algorithm.
  11. 如权利要求5所述的图像识别方法,其特征在于,所述识别模块还包括语义分割子模型;The image recognition method according to claim 5, wherein the recognition module further includes a semantic segmentation sub-model;
    所述通过所述识别模块对提取到的特征进行识别,得到识别结果,包括:The extracted features are identified through the identification module and the identification results are obtained, including:
    基于所述语义分割子模型对提取到的特征进行分割,得到语义分割结果。The extracted features are segmented based on the semantic segmentation sub-model to obtain a semantic segmentation result.
  12. 如权利要求11所述的图像识别方法,其特征在于,所述语义分割子模型包括分割模块、合并模块和第四卷积模块;The image recognition method according to claim 11, wherein the semantic segmentation sub-model includes a segmentation module, a merging module and a fourth convolution module;
    所述基于所述语义分割子模型对提取到的特征进行分割,得到语义分割结果,包括:The extracted features are segmented based on the semantic segmentation sub-model to obtain semantic segmentation results, including:
    基于所述分割模块对提取到的特征进行多尺度卷积处理,得到不同大小的特征图;Perform multi-scale convolution processing on the extracted features based on the segmentation module to obtain feature maps of different sizes;
    基于所述合并模块沿通道方向对所述不同大小的特征图进行合并;Merge the feature maps of different sizes along the channel direction based on the merging module;
    基于所述第四卷积模块对所述合并模块输出的特征图进行卷积处理,得到语义分割结果。The feature map output by the merging module is convolved based on the fourth convolution module to obtain a semantic segmentation result.
  13. 如权利要求12所述的图像识别方法,其特征在于,所述分割模块包括M个并行的卷积分支,最上层的卷积分支采用1*1卷积,其它卷积分支采用膨胀因子依次增大的膨胀卷积,其中,M为大于1的正整数;The image recognition method according to claim 12, characterized in that the segmentation module includes M parallel convolution branches, the uppermost convolution branch adopts 1*1 convolution, and the other convolution branches adopt expansion factors to increase in sequence. Large dilated convolution, where M is a positive integer greater than 1;
    所述基于所述分割模块对提取到的特征进行多尺度卷积处理,得到语义分割结果,包括:The extracted features are subjected to multi-scale convolution processing based on the segmentation module to obtain semantic segmentation results, including:
    基于所述卷积分支对提取到的特征进行多尺度卷积处理。Multi-scale convolution processing is performed on the extracted features based on the convolution branch.
  14. 一种图像识别装置,其特征在于,包括:An image recognition device, characterized in that it includes:
    输入模块和经训练的神经网络模型,所述神经网络模型基于纯卷积的Meta Former结构提取待识别图像的全局特征;Input module and trained neural network model, which extracts global features of the image to be recognized based on a purely convolutional Meta Former structure;
    所述输入模块用于:将待识别图像输入所述神经网络模型;The input module is used to: input the image to be recognized into the neural network model;
    所述神经网络模型用于:对所述待识别图像依次进行特征提取和识别,得到识别结果。The neural network model is used to sequentially perform feature extraction and recognition on the image to be recognized to obtain a recognition result.
  15. 如权利要求14所述的图像识别装置,其特征在于,所述神经网络模型包括:特征提取模块和识别模块;所述特征提取模块包括基于卷积的Meta Former结构构建的全局特征提取模块;The image recognition device according to claim 14, wherein the neural network model includes: a feature extraction module and a recognition module; the feature extraction module includes a global feature extraction module constructed based on the convolutional Meta Former structure;
    所述神经网络模型具体用于:基于所述特征提取模块对待识别图像进行特征提取;The neural network model is specifically used to: extract features from the image to be identified based on the feature extraction module;
    基于所述识别模块对提取到的特征进行识别,得到识别结果。The extracted features are identified based on the identification module to obtain identification results.
  16. 根据权利要求15所述的图像识别装置,其特征在于,所述特征提取模块还包括第一卷积模块;The image recognition device according to claim 15, wherein the feature extraction module further includes a first convolution module;
    所述基于所述特征提取模块对待识别图像进行特征提取,包括:The feature extraction of the image to be identified based on the feature extraction module includes:
    基于所述第一卷积模块对待识别图像进行局部特征提取,得到第一局部特征图像;Extract local features of the image to be identified based on the first convolution module to obtain a first local feature image;
    基于所述全局特征提取模块对所述第一局部特征图像进行全局特征提取,得到全局特征;Perform global feature extraction on the first local feature image based on the global feature extraction module to obtain global features;
    识别模块包括:The identification module includes:
    图像分类单元,用于基于所述图像识别子模型对提取到的特征进行分类处理,得到图像分类结果。An image classification unit is used to classify the extracted features based on the image recognition sub-model to obtain image classification results.
  17. 根据权利要求16所述的图像识别装置,其特征在于,所述特征提取模块还包括:第二卷积模块、融合模块,所述第二卷积模块采用轻量级卷积神经网络提取第二局部特征;The image recognition device according to claim 16, characterized in that the feature extraction module further includes: a second convolution module and a fusion module, the second convolution module uses a lightweight convolutional neural network to extract the second local features;
    所述第二卷积模块用于对所述第一局部特征图像进行局部特征提取,得到第二局部特征;The second convolution module is used to extract local features from the first local feature image to obtain second local features;
    所述融合模块用于将所述第二局部特征与所述全局特征进行融合,得到融合特征。The fusion module is used to fuse the second local feature with the global feature to obtain a fusion feature.
  18. 根据权利要求17所述的图像识别装置,其特征在于,所述全局特征提取模块由输入端至输出端包括依次连接的第一残差模块、全局特征子提取模块、第一加并模块、第二残差模块、前馈网络模块和第二加并模块;The image recognition device according to claim 17, wherein the global feature extraction module includes a first residual module, a global feature sub-extraction module, a first addition module, a first merging module, and a first residual module connected in sequence from the input end to the output end. Two residual modules, feedforward network module and second addition module;
    所述第一加并模块用于对所述第一残差模块的输入数据和所述全局特征子提取模块的输出数据进行相加合并处理;The first adding and merging module is used for adding and merging the input data of the first residual module and the output data of the global feature extraction module;
    所述第二加并模块用于对所述第二残差模块的输入数据和所述前馈网络模块的输出数据进行相加合并处理;The second addition module is used to perform addition and merging processing on the input data of the second residual module and the output data of the feedforward network module;
    所述全局特征子提取模块包括第一分支、第二分支和合并模块;The global feature sub-extraction module includes a first branch, a second branch and a merging module;
    所述第一分支用于通过深度可分离卷积对输入图像进行局部特征提取;The first branch is used to extract local features of the input image through depth-separable convolution;
    所述第二分支用于对输入图像进行全局特征提取;The second branch is used to extract global features from the input image;
    所述合并模块用于根据像素位置对所述第一分支和所述第二分支输出的特征进行合并处理,得到全局特征。The merging module is used to merge the features output by the first branch and the second branch according to the pixel position to obtain global features.
  19. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至13任一项所述的方法。A terminal device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the computer program, it implements claims 1 to 1 The method described in any one of 13.
  20. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至13任一项所述的方法。A computer-readable storage medium stores a computer program, characterized in that when the computer program is executed by a processor, the method according to any one of claims 1 to 13 is implemented.
PCT/CN2022/142418 2022-06-30 2022-12-27 Image recognition method and apparatus based on neural network model, and terminal device WO2024001123A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210763998.5 2022-06-30
CN202210763998.5A CN115187844A (en) 2022-06-30 2022-06-30 Image identification method and device based on neural network model and terminal equipment

Publications (1)

Publication Number Publication Date
WO2024001123A1 true WO2024001123A1 (en) 2024-01-04

Family

ID=83515943

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142418 WO2024001123A1 (en) 2022-06-30 2022-12-27 Image recognition method and apparatus based on neural network model, and terminal device

Country Status (2)

Country Link
CN (1) CN115187844A (en)
WO (1) WO2024001123A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117523535A (en) * 2024-01-08 2024-02-06 浙江零跑科技股份有限公司 Traffic sign recognition method, terminal equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187844A (en) * 2022-06-30 2022-10-14 深圳云天励飞技术股份有限公司 Image identification method and device based on neural network model and terminal equipment
CN116704526B (en) * 2023-08-08 2023-09-29 泉州师范学院 Staff scanning robot and method thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363233A (en) * 2019-06-28 2019-10-22 西安交通大学 A kind of the fine granularity image-recognizing method and system of the convolutional neural networks based on block detector and Fusion Features
CN113034444A (en) * 2021-03-08 2021-06-25 安徽建筑大学 Pavement crack detection method based on MobileNet-PSPNet neural network model
US20210303911A1 (en) * 2019-03-04 2021-09-30 Southeast University Method of segmenting pedestrians in roadside image by using convolutional network fusing features at different scales
CN114677359A (en) * 2022-04-07 2022-06-28 长沙理工大学 Seam clipping image detection method and system based on CNN
CN115187844A (en) * 2022-06-30 2022-10-14 深圳云天励飞技术股份有限公司 Image identification method and device based on neural network model and terminal equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210303911A1 (en) * 2019-03-04 2021-09-30 Southeast University Method of segmenting pedestrians in roadside image by using convolutional network fusing features at different scales
CN110363233A (en) * 2019-06-28 2019-10-22 西安交通大学 A kind of the fine granularity image-recognizing method and system of the convolutional neural networks based on block detector and Fusion Features
CN113034444A (en) * 2021-03-08 2021-06-25 安徽建筑大学 Pavement crack detection method based on MobileNet-PSPNet neural network model
CN114677359A (en) * 2022-04-07 2022-06-28 长沙理工大学 Seam clipping image detection method and system based on CNN
CN115187844A (en) * 2022-06-30 2022-10-14 深圳云天励飞技术股份有限公司 Image identification method and device based on neural network model and terminal equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YU WEIHAO, LIO, ZHOU, WANG, FENG, YAN: "MetaFormer is Actually What You Need for Vision", CV CASE SELECTION, 27 January 2022 (2022-01-27), XP093122181, Retrieved from the Internet <URL:http://www.jianshu.com/p/ab5e3fbc5b69> [retrieved on 20240122] *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117523535A (en) * 2024-01-08 2024-02-06 浙江零跑科技股份有限公司 Traffic sign recognition method, terminal equipment and storage medium
CN117523535B (en) * 2024-01-08 2024-04-12 浙江零跑科技股份有限公司 Traffic sign recognition method, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN115187844A (en) 2022-10-14

Similar Documents

Publication Publication Date Title
WO2024001123A1 (en) Image recognition method and apparatus based on neural network model, and terminal device
US20230267735A1 (en) Method for structuring pedestrian information, device, apparatus and storage medium
CN111428664B (en) Computer vision real-time multi-person gesture estimation method based on deep learning technology
CN111160350A (en) Portrait segmentation method, model training method, device, medium and electronic equipment
CN112487886A (en) Method and device for identifying face with shielding, storage medium and terminal
WO2024077781A1 (en) Convolutional neural network model-based image recognition method and apparatus, and terminal device
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
Nanni et al. Combining face and eye detectors in a high-performance face-detection system
WO2023040146A1 (en) Behavior recognition method and apparatus based on image fusion, and electronic device and medium
CN111353544A (en) Improved Mixed Pooling-Yolov 3-based target detection method
Hussain et al. A survey of traffic sign recognition systems based on convolutional neural networks
Hou et al. A cognitively motivated method for classification of occluded traffic signs
CN109002808B (en) Human behavior recognition method and system
CN114333062A (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
Najibi et al. Towards the success rate of one: Real-time unconstrained salient object detection
Ahmad et al. Embedded deep vision in smart cameras for multi-view objects representation and retrieval
CN111881803B (en) Face recognition method based on improved YOLOv3
Wang et al. CDFF: a fast and highly accurate method for recognizing traffic signs
Qu et al. Improved YOLOv5-based for small traffic sign detection under complex weather
Li et al. Incremental learning of infrared vehicle detection method based on SSD
CN110659631A (en) License plate recognition method and terminal equipment
WO2024077785A1 (en) Image recognition method and apparatus based on convolutional neural network model, and terminal device
Xie et al. Mask wearing detection based on YOLOv5 target detection algorithm under COVID-19
Özyurt et al. A new method for classification of images using convolutional neural network based on Dwt-Svd perceptual hash function
Grekov et al. Application of the YOLOv5 Model for the Detection of Microobjects in the Marine Environment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22949199

Country of ref document: EP

Kind code of ref document: A1