CN111507288A

CN111507288A - Image detection method, image detection device, computer equipment and storage medium

Info

Publication number: CN111507288A
Application number: CN202010321181.3A
Authority: CN
Inventors: 周康明; 赵佳男
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-08-07

Abstract

The application relates to an image detection method, an image detection device, a computer device and a storage medium. The method comprises the following steps: acquiring a video image; the video image comprises at least two video frames, and each video frame of the video image comprises an object to be detected; respectively extracting the characteristics of each video frame of the video image to obtain characteristic data corresponding to the video image; performing characteristic screening processing on the characteristic data corresponding to the video image to obtain target characteristic data; and inputting the target characteristic data into a classification model for classification, and determining the class of the object to be detected. The method can improve the accuracy of the image detection result.

Description

Image detection method, image detection device, computer equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image detection method, an image detection apparatus, a computer device, and a storage medium.

Background

In recent years, with the development of the deep learning technology, the application of the deep learning technology to detect the target on the image is more and more extensive, and since the video image has more information than a single picture, the video image is more and more prone to be used for target detection when the target is detected by the deep learning technology.

In the related art, when a target in a video image is detected by using a depth learning technology, generally, each frame of image of the video image is subjected to feature extraction, then, the features extracted from each frame of image are subjected to pooling processing, that is, all the extracted features of each frame of image are subjected to fusion processing to obtain fusion features, and finally, the fusion features are detected to obtain a detection result.

However, the above-mentioned techniques have a problem that the accuracy of the obtained detection result is not high.

Disclosure of Invention

In view of the above, it is necessary to provide an image detection method, an apparatus, a computer device, and a storage medium capable of improving accuracy of detection results in view of the above technical problems.

An image detection method, the method comprising:

acquiring a video image; the video image comprises at least two video frames, and each video frame of the video image comprises an object to be detected;

respectively extracting the characteristics of each video frame of the video image to obtain characteristic data corresponding to the video image;

carrying out feature screening processing on feature data corresponding to the video image to obtain target feature data;

and inputting the target characteristic data into a classification model for classification, and determining the class of the object to be detected.

In one embodiment, the above respectively performing feature extraction on each video frame of a video image to obtain feature data corresponding to the video image includes:

respectively extracting the characteristics of each video frame of the video image to obtain a characteristic image corresponding to each video frame;

respectively carrying out dimension reduction processing on the feature maps corresponding to the video frames to obtain one-dimensional feature vectors corresponding to the video frames;

and obtaining the characteristic data corresponding to the video image according to the one-dimensional characteristic vector corresponding to each video frame.

In one embodiment, the obtaining the feature data corresponding to the video image according to the one-dimensional feature vector corresponding to each video frame includes:

splicing the one-dimensional characteristic vectors corresponding to the video frames to obtain two-dimensional spliced characteristic vectors; the two dimensions of the two-dimensional splicing feature vector comprise a one-dimensional feature vector dimension of each video frame and a quantity dimension of each video frame;

splicing the two-dimensional splicing feature vectors and the number of the video images to obtain three-dimensional feature vectors, and determining the three-dimensional feature vectors as feature data corresponding to the video images; the third dimension of the three-dimensional feature vector is the number dimension of the video image.

In one embodiment, the performing the feature screening process on the feature data corresponding to the video image to obtain the target feature data includes:

carrying out affine transformation processing on the feature data corresponding to the video image to obtain feature data after affine transformation; the feature data after affine transformation is four-dimensional feature data;

and performing feature screening processing on the feature data after affine transformation by adopting a capsule network and a dynamic routing algorithm in the capsule network to obtain target feature data.

In one embodiment, the four dimensions of the affine-transformed feature data include: the number dimension of the video images, the one-dimensional feature vector dimension of each video frame, the number dimension of the neurons on the next layer of the convolutional layer in the capsule network and the dimension of the neurons on the next layer of the convolutional layer in the capsule network; wherein, the next layer of the convolution layer in the capsule network is the next layer of the convolution layer in which the characteristic data corresponding to the video image is positioned.

In one embodiment, the performing feature screening processing on the feature data after affine transformation by using the capsule network and a dynamic routing algorithm in the capsule network to obtain target feature data includes:

performing feature screening processing on feature data after affine transformation by adopting a capsule network and a dynamic routing algorithm in the capsule network to obtain initial feature data; the initial characteristic data is three-dimensional characteristic data;

normalizing the initial characteristic data to obtain target characteristic data; the target feature data is two-dimensional feature data.

In one embodiment, the three dimensions of the initial feature data include: the number dimension of the video images, the number dimension of the neurons on the next layer of the convolutional layer in the capsule network and the dimension of the neurons on the next layer of the convolutional layer in the capsule network; the two dimensions of the target feature data include: the number dimension of the video images and the number dimension of the neurons on the next layer of convolutional layers in the capsule network.

An image sensing apparatus, the apparatus comprising:

the acquisition module is used for acquiring a video image; the video image comprises at least two video frames, and each video frame of the video image comprises an object to be detected;

the extraction module is used for respectively extracting the characteristics of each video frame of the video image to obtain the characteristic data corresponding to the video image;

the screening module is used for carrying out characteristic screening processing on the characteristic data corresponding to the video image to obtain target characteristic data;

and the classification module is used for inputting the target characteristic data into the classification model for classification and determining the class of the object to be detected.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

According to the image detection method, the image detection device, the computer equipment and the storage medium, the video image comprising at least two video frames is obtained, each video frame comprises the object to be detected, feature extraction is carried out on each video frame to obtain feature data corresponding to the video image, feature screening processing is carried out on the feature data corresponding to the video image to obtain target feature data, and the target feature data are input into the classification model of all the objects to be detected. In the method, the feature screening process can also be called as a feature fusion process, when the video image is classified, the feature screening process is carried out on the extracted features of each video frame instead of blindly selecting all the features of each video frame to carry out subsequent classification process, so that the features corresponding to the images with poor image quality in each video frame can be selectively removed, the features contributing to the subsequent classification process is reserved, and the subsequent classification result is more accurate, namely, the detection result is more accurate. In addition, because the extracted features of each video frame can be subjected to feature screening processing in the method, no matter how the image quality of each video frame in the original video image is, a better detection result can be obtained by processing by using the method, namely the method does not depend on the image quality of the original video image excessively and can also obtain a better detection result, and therefore, the method has a wider application range.

Drawings

FIG. 1 is a diagram illustrating an internal structure of a computer device according to an embodiment;

FIG. 2 is a flow diagram illustrating an exemplary image detection method;

FIG. 3 is a flow chart illustrating an image detection method according to another embodiment;

FIG. 4 is a flow chart illustrating an image detection method according to another embodiment;

FIG. 5 is a block diagram showing the structure of an image detection apparatus according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In recent years, the related technology of deep learning is more mature, and is currently widely applied to the fields of CV (Computer Vision), N L P (Natural language Processing), and the like, in the CV field, image data is the most common and easily available, in recent years, with the rise of a large number of video websites, video class software, and short video APPs (applications), a large amount of video data is generated every day, and more video databases are available for deep learning research.

The image detection method provided by the application can be applied to computer equipment, and the computer equipment can be a terminal or a server. Taking a computer device as an example, the internal structure diagram thereof can be as shown in fig. 1. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an image detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The execution subject of the embodiment of the present application may be a computer device, or may be an image detection apparatus, and the following description will be given taking a computer device as an execution subject.

In an embodiment, an image detection method is provided, and the embodiment relates to a specific process of how to obtain target feature data according to extracted features of each video frame of a video image and obtain a detection result according to the target feature data. As shown in fig. 2, the method may include the steps of:

s202, acquiring a video image; the video image comprises at least two video frames, and each video frame of the video image comprises an object to be detected.

The object to be detected may be a human, an animal, a plant, an object, or the like. Accordingly, the video image may be a video image obtained by image-capturing the person, animal, plant, object, or the like. The number of video frames captured by a video image may be determined according to actual situations, for example, 10 video frames, 20 video frames, and so on, where the video image includes video frames having time continuity, and for example, the second video frame in the video image is captured after the first video frame and is a frame image captured at two consecutive time instants. The video images herein may also be referred to as a picture sequence, and each video frame may also be referred to as each picture in the picture sequence. In addition, each video frame includes an object to be detected, but the posture and the like of the included object to be detected may be the same or different.

Specifically, the computer device may acquire an image of the object to be detected by using an image acquisition device (e.g., a video camera, a snapshot machine, a camera, etc.) connected to the computer device, so as to obtain a video image of the object to be detected, and transmit the video image of the object to be detected to the computer device, so that the computer device may obtain the video image; of course, the computer device may also read the video image from the database in which the video image of the object to be detected is stored in advance; certainly, the computer device may also obtain the video image by reading from the cloud, and certainly, the computer device may also obtain the video image by adopting other manners, and the obtaining manner of the video image is not particularly limited in this embodiment.

And S204, respectively extracting the characteristics of each video frame of the video image to obtain the characteristic data corresponding to the video image.

When extracting features of each video frame of a video image, the features may be extracted by using a neural network, where the neural network may be ResNet (residual error network), SeNet (Squeeze-and-excitation network, compression and excitation network), Goog L eNet (Google inclusion Net, Google initial network), or other neural networks, and this embodiment mainly uses a ResNet50 network.

Specifically, after obtaining the video image, the computer device may input each video frame of the video image into the neural network, perform feature extraction on each video frame in the neural network to obtain feature data corresponding to each video frame, and then combine the feature data corresponding to each video frame to obtain the feature data of the video image. It should be noted that, here, the neural network may input one video image at a time, that is, input one picture sequence, or may input a plurality of picture sequences, that is, input a batch of picture sequences.

And S206, performing characteristic screening processing on the characteristic data corresponding to the video image to obtain target characteristic data.

The feature screening may also be referred to as feature fusion, in which all feature data corresponding to a video image are selected, and the selected features are combined to form target feature data. The feature screening processing method may be to perform the feature screening by using a Dynamic Routing Algorithm (DRA), or by using a combination of the Dynamic Routing Algorithm and a capsule network, or by using a combination of the Dynamic Routing Algorithm and another neural network, which is not specifically limited in this embodiment.

Specifically, after obtaining the feature data corresponding to the video image, the computer device may adopt a preset algorithm and/or a network to perform screening processing on the feature data, and during the screening processing, the feature data with a small contribution may be removed, that is, the feature corresponding to the picture with poor picture quality may be selectively removed, the feature contributing to the subsequent task with a large contribution may be retained, and finally, the screened features may be obtained, and the screened features may be combined to obtain the target feature data. When feature screening is performed, the target feature data obtained after screening may include a part of features of each video frame in the video image, or may also include a part of features of a part of video frames, or may also include all features of a part of video frames, and the like, so that features of a picture with poor image quality may be removed completely or partially, thereby avoiding a problem that the features of pictures with poor image quality (for example, pictures are not clear or an object to be detected is blocked on the picture) participate in subsequent detection tasks too much, and further the detection accuracy of the subsequent detection tasks is affected.

It should be noted that, in the embodiment, when performing the feature screening on the feature data of the video image, the feature screening may be performed on an image of any quality of each video frame of the video image, that is, regardless of the image quality of each video frame of the original video image, the method of the embodiment may be used to perform the feature screening, so as to obtain more accurate target feature data.

And S208, inputting the target characteristic data into a classification model for classification, and determining the class of the object to be detected.

The classification model here may be a neural network model, such as a convolutional neural network model. The structure of the classification model here may be several fully connected layers plus a softmax layer, of course, only the softmax layer may be also possible, and of course, other network structures may also be possible. In addition, taking the object to be detected as an example, the category of the object to be detected may be the name, sex, age, occupation, and the like of the object to be detected.

Of course, before classifying the target feature data, the classification model may also be trained, and the training process of the classification model may be: the method comprises the steps of obtaining a characteristic data set of a sample video image, wherein the characteristic data set of the sample video image comprises characteristic data of a training video image and a labeling category of an object to be detected corresponding to the characteristic data of the training video image, each video frame of the training video image comprises the object to be detected, and then training an initial classification model based on the characteristic data set of the sample video image to obtain a trained classification model.

Specifically, after obtaining the target feature data corresponding to the video image, the computer device may input the target feature data corresponding to the video image into the trained classification model, so as to obtain the class of the object to be detected corresponding to the target feature data. The categories may be represented by numbers, letters, words, and the like.

In the image detection method, the video image comprising at least two video frames is obtained, each video frame comprises the object to be detected, the characteristic extraction is carried out on each video frame to obtain the characteristic data corresponding to the video image, the characteristic data corresponding to the video image is subjected to the characteristic screening processing to obtain the target characteristic data, and the target characteristic data is input into the classification model of the object to be detected. In the method, the feature screening process can also be called as a feature fusion process, when the video image is classified, the feature screening process is carried out on the extracted features of each video frame instead of blindly selecting all the features of each video frame to carry out subsequent classification process, so that the features corresponding to the images with poor image quality in each video frame can be selectively removed, the features contributing to the subsequent classification process is reserved, and the subsequent classification result is more accurate, namely, the detection result is more accurate. In addition, because the extracted features of each video frame can be subjected to feature screening processing in the method, no matter how the image quality of each video frame in the original video image is, a better detection result can be obtained by processing by using the method, namely the method does not depend on the image quality of the original video image excessively and can also obtain a better detection result, and therefore, the method has a wider application range.

In another embodiment, another image detection method is provided, and this embodiment relates to a specific process of how to extract features from each video frame of a video image to obtain target feature data corresponding to the video image. On the basis of the above embodiment, as shown in fig. 3, the above S204 may include the following steps:

and S302, respectively extracting the characteristics of each video frame of the video image to obtain a characteristic diagram corresponding to each video frame.

And S304, respectively performing dimension reduction processing on the feature maps corresponding to the video frames to obtain one-dimensional feature vectors corresponding to the video frames.

And S306, obtaining the feature data corresponding to the video image according to the one-dimensional feature vector corresponding to each video frame.

In this embodiment, after obtaining a video image, each video frame of the video image may be input to a neural network to obtain a feature map corresponding to each video frame, where the obtained feature map is generally a two-dimensional vector and includes two dimensions, namely, a value and a coordinate, of each pixel point on the feature map, the two-dimensional feature map corresponding to each video frame may be first converted into a one-dimensional feature vector, and during the conversion, the value and the coordinate of each pixel point on the feature map may be represented by one feature value, and then one feature value corresponding to all pixel points on the feature map is serially connected and spliced together according to a certain sequence, so that a one-dimensional vector corresponding to the feature map may be obtained and recorded as a one-dimensional feature vector, and the one-dimensional feature vector corresponding to each video frame may be obtained by performing such an operation on each video frame. Then, feature data corresponding to the video image may be obtained based on the one-dimensional feature vector corresponding to each video frame, and optionally, the following steps a1 and a2 may be adopted:

a1, splicing the one-dimensional feature vectors corresponding to the video frames to obtain two-dimensional spliced feature vectors; the two dimensions of the two-dimensional stitching feature vector include a one-dimensional feature vector dimension of each video frame and a number dimension of each video frame.

A2, splicing the two-dimensional splicing feature vectors and the number of the video images to obtain three-dimensional feature vectors, and determining the three-dimensional feature vectors as feature data corresponding to the video images; the third dimension of the three-dimensional feature vector is the number dimension of the video image.

In steps a1 and a2, the one-dimensional feature vectors corresponding to each video frame may be combined according to their respective sequence in the video image to obtain a two-dimensional vector including two dimensions of the number of video frames and the one-dimensional feature vector, which is recorded as a two-dimensional stitching feature vector, and meanwhile, since the general neural network may include a sequence (video image) into which one or a lot of pictures may be input, the number of video images may also be combined as one dimension with the two-dimensional stitching feature vector to obtain a three-dimensional feature vector, which may also be referred to as a three-dimensional tensor, and this is used as feature data corresponding to the video image.

By way of example, assuming that the video image S includes T pictures, i.e., T video frames, the video image may be represented by the following formula (1), as follows:

S＝[x₁，x₂，...，x_i，...，x_T](1)

where xi represents the ith video frame in the video image.

After feature extraction is performed on each video frame, a one-dimensional feature vector obtained by each video frame is recorded as C, C can be combined with the number T of the video frames to obtain a two-dimensional splicing feature vector (C, T), and if the number of the input video images is B, B and (C, T) are combined to obtain a three-dimensional feature vector which is recorded as (B, C, T).

The image detection method provided by the embodiment can be used for respectively extracting the features of each video frame of a video image to obtain the feature map corresponding to each video frame; respectively carrying out dimension reduction processing on the feature maps corresponding to the video frames to obtain one-dimensional feature vectors corresponding to the video frames; and obtaining the characteristic data corresponding to the video image according to the one-dimensional characteristic vector corresponding to each video frame. In this embodiment, the two-dimensional feature maps corresponding to the video frames may be subjected to data transformation to obtain one-dimensional feature vectors, so that data dimensionality of all the video frames may be reduced, that is, dimensionality of feature data of the video image may be reduced, which facilitates subsequent processing of the feature data of the video image.

In another embodiment, another image detection method is provided, and the embodiment relates to a specific process of how to filter the video image feature data to obtain the target feature data. On the basis of the above embodiment, as shown in fig. 4, the above S206 may include the following steps:

s402, carrying out affine transformation processing on the feature data corresponding to the video image to obtain feature data after affine transformation; the affine-transformed feature data is four-dimensional feature data.

S404, performing feature screening processing on the feature data after affine transformation by adopting a capsule network and a dynamic routing algorithm in the capsule network to obtain target feature data.

In this embodiment, when the feature data corresponding to the video image is screened, a method combining a capsule network and a dynamic routing algorithm is selected, and before the feature data is screened by using the dynamic routing algorithm, because the feature data of the video image is generally a scalar value on a pixel point of each video frame image, the feature data of the video image generally needs to be vectorized, that is, the scalar value on the pixel point of each video frame image needs to be converted into a vector on a corresponding pixel point.

When vectorizing the feature data corresponding to the video image, each channel C in (B, C, T) may be regarded as a vector neuron (capsule) in a capsule network, and an initial dimension of each feature vector is T, and then (B, C, T) is subjected to affine transformation, after transformation, the dimension of the feature data of the video image (i.e., a picture sequence) changes, that is, the three-dimensional tensor becomes a four-dimensional tensor, as shown in the following formula (2):

(B,C,T)→(B,C,N_capsule,next,D_next) (2)

in formula (2), the four dimensions of the affine-transformed feature data include: the number dimension of the video images, the one-dimensional feature vector dimension of each video frame, the number dimension of the neurons on the next layer of the convolutional layer in the capsule network and the dimension of the neurons on the next layer of the convolutional layer in the capsule network; wherein, the next layer of the convolution layer in the capsule network is the next layer of the convolution layer in which the characteristic data corresponding to the video image is positioned.

The above specific affine transformation process is as follows:

1, copying the three-dimensional eigenvectors or three-dimensional tensors (B, C, T) along a third dimension, copying N_capsule,nextThen N is added_capsule,nextThe (B, C, T) dimension tensors are spliced together to form the four-dimension tensor (B, C, N)_capsule,nextT); wherein N is_capsule,nextIs the number of vector neurons on the next convolutional layer in the capsule network, D_nextThe dimension of the vector neuron on the next layer of convolution layer in the capsule network can be obtained when the capsule network structure is designed;

2, for the following matrix multiplication, the four-dimensional tensors (B, C, N)_capsule,nextT) is re-expanded into a five-dimensional tensor (B, C, N)_capsule,next,T,1)；

3, constructing a five-dimensional affine transformation tensor (B, C, N)_capsule,next,D_next,T)；

4, combining the two five-dimensional tensors (B, C, N)_capsule,nextT,1) and (B, C, N)_capsule,next,D_nextThe fourth dimension and the fifth dimension of T) are subjected to matrix multiplication, namely (T, 1) and (D) are subjected to matrix multiplication_nextT) is multiplied to obtain a five-dimensional tensor (B, C, N)_capsule,next,D_next1), as shown in the following formula (3):

(B,C,N_capsule,next,D_next,T)·(B,C,N_capsule,next,T,1)＝(B,C,N_capsule,next,D_next,1) (3)

5, converting the five-dimensional tensor (B, C, N)_capsule,next,D_next1) compression into a four-dimensional tensor as shown in equation (4) below:

(B,C,N_capsule,next,D_next,1)→(B,C,N_capsule,next,D_next) (4)

through the affine transformation process, the feature data corresponding to the video image can be converted into vector data, that is, into four-dimensional feature data, and it can be seen that in the affine transformation process, the video frame number T is fused into the last four-dimensional feature data in the affine process, and the video frame number T can represent the time information of the video image, because the video frames have time sequence, the finally obtained four-dimensional feature data is subjected to data fusion between pictures, and contains more information.

After the four-dimensional feature data is obtained, feature screening or feature fusion can be performed on the four-dimensional feature data by adopting the capsule network and a dynamic routing algorithm therein. The dynamic routing algorithm is a core component of a capsule network structure, and the essence of the dynamic routing algorithm is that the iterative idea is adopted to dynamically select the neurons which are important for the downstream task from the vector neurons at the lower layer, the neurons at the upper layer are combined to participate in the downstream task, and the refining and optimization of the characteristics are realized. That is, optionally, the capsule network and the dynamic routing algorithm in the capsule network may be adopted to perform feature screening processing on the feature data after affine transformation to obtain initial feature data; i.e. four-dimensional tensors (B, C, N) after the radial transformation_capsule,next,D_next) As input to the dynamic routing algorithm, the output of the dynamic routing algorithm is obtained: (B, N)_capsule,next,D_next) Recording as initial characteristic data, wherein the initial characteristic data is three-dimensional characteristic data; the three dimensions of the initial feature data include: the number dimension of the video images, the number dimension of the neurons on the next layer of the convolutional layer in the capsule network, and the dimension of the neurons on the next layer of the convolutional layer in the capsule network.

The dynamic routing algorithm is an algorithm which can automatically establish a routing table of the dynamic routing algorithm and can be adjusted in time according to the change of actual conditions, the capsule network is a novel neural network structure, a new framework simulating a human visual system is used for obtaining translation degeneration, original translation invariance is replaced, and the dynamic routing algorithm can be used for obtaining wider generalization by using less data under different visual angles.

When the specific dynamic routing algorithm is used for feature screening and fusion, the process is as follows:

assume a lower ith vector (i ═ 1, 2.. C) and an upper jth vector (i ═ 1, 2.. N) of the upper layer_capsule,next) Respectively using u_iAnd v_jIs represented by the formula, wherein u_iCan represent the obtained four-dimensional characteristic data after radiation transformation, v_jThe following formula (5) can be expressed:

wherein the content of the first and second substances,

m is any value from 1 to C, and the square function is an activation function (or called compression function) for characterizing the degree of significance or the magnitude of contribution of the extracted features, and can be specifically expressed by the following formula (6):

wherein x may be the content in parentheses of the above formula (5);

combining equation (5) and equation (6) above, equation (7) can be obtained as follows:

as can be seen, both sides of the equation of equation (7) include v_jTherefore, to find v_jThe calculation can be performed in an iterative manner, and the specific iterative calculation process is as follows:

a, initializing u_jCoefficient b of_i＝0；

B, iterating r times, wherein the specific iteration steps are as follows:

wherein p is_ijIs u_iIs characterized by_jExtracted scale factor vector due to u_iA plurality of, each u_iAll have a p_iAnd therefore also a plurality of p_iWhich is a set of vectors, p_iIs p_ijAn element or coefficient in the vector; r is a hyper-parameter which can be set manually according to actual conditions, the value cannot be too large so as to avoid overfitting, and generally 3 or 4 can be selected; s_jThat is ∑ described above_ip_iju_iThe value of (c).

Through the algorithm steps, v can be finally calculated_jAll v are combined_jCombined to obtain three-dimensional initial characteristic data (B, N)_capsule,next,D_next)。

The capsule network and the dynamic routing algorithm are adopted for feature screening or fusion, feature screening and combination can be dynamically carried out, information such as homogeneity, hue, posture, albedo, texture, deformation, speed, object position and the like can be reserved, namely more information of images can be reserved, and screened feature data has larger contribution to a subsequent image detection task, so that the accuracy of the detection result of the subsequent image detection task can be improved.

After the three-dimensional initial feature data are obtained, in order to facilitate data processing of downstream tasks, optionally, normalization processing may be performed on the initial feature data to obtain target feature data; the target characteristic data is two-dimensional characteristic data; the two dimensions of the target feature data include: the number dimension of the video images and the number dimension of the neurons on the next layer of convolutional layers in the capsule network. That is, the three-dimensional initial feature data (B, N) obtained as described above may be subjected to_capsule,next,D_next) The final dimension of the target feature data is normalized, where the normalization process may be an L2 norm, so that the normalized target feature data can be obtained, which can be represented by the following formula (9):

(B,N_capsule,next,D_next)→(B,N_capsule,next) (9)

after the target characteristic data are obtained, the target characteristic data can be classified to obtain the category of the object to be detected.

The image detection method provided by this embodiment can perform affine transformation processing on the feature data corresponding to the video image to obtain feature data after affine transformation; the feature data after affine transformation is four-dimensional feature data, and feature screening processing is carried out on the feature data after affine transformation by adopting a capsule network and a dynamic routing algorithm in the capsule network to obtain target feature data. In this embodiment, the capsule network and the dynamic routing algorithm may be used to perform the feature screening processing on the feature data of the video image, and the feature data screened by the capsule network and the dynamic routing algorithm also contributes more to the subsequent image detection task, so that the accuracy of the detection result of the subsequent image detection task may be improved.

In another embodiment, in order to facilitate a more detailed description of the technical solution of the present application, the following description is given in conjunction with a more detailed embodiment, and the method may include the following steps S1-S9:

s1, acquiring a video image; the video image comprises at least two video frames, and each video frame of the video image comprises an object to be detected.

And S2, respectively extracting the features of each video frame of the video image to obtain a feature map corresponding to each video frame.

And S3, performing dimension reduction processing on the feature maps corresponding to the video frames respectively to obtain one-dimensional feature vectors corresponding to the video frames.

And S4, performing splicing processing on the one-dimensional feature vectors corresponding to the video frames to obtain two-dimensional splicing feature vectors.

And S5, splicing the two-dimensional splicing feature vectors and the number of the video images to obtain three-dimensional feature vectors, and determining the three-dimensional feature vectors as feature data corresponding to the video images.

S6, performing affine transformation processing on the feature data corresponding to the video image to obtain feature data after affine transformation; the affine-transformed feature data is four-dimensional feature data.

And S7, performing feature screening processing on the feature data after affine transformation by adopting the capsule network and a dynamic routing algorithm in the capsule network to obtain initial feature data.

And S8, carrying out normalization processing on the initial characteristic data to obtain target characteristic data.

And S9, inputting the target characteristic data into the classification model for classification, and determining the class of the object to be detected.

It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 5, there is provided an image detection apparatus including: an obtaining module 10, an extracting module 11, a screening module 12 and a classifying module 13, wherein:

an obtaining module 10, configured to obtain a video image; the video image comprises at least two video frames, and each video frame of the video image comprises an object to be detected;

the extraction module 11 is configured to perform feature extraction on each video frame of the video image to obtain feature data corresponding to the video image;

the screening module 12 is configured to perform feature screening processing on feature data corresponding to the video image to obtain target feature data;

and the classification module 13 is configured to input the target feature data into a classification model for classification, and determine a category of the object to be detected.

For specific limitations of the image detection apparatus, reference may be made to the above limitations of the image detection method, which are not described herein again.

In another embodiment, another image detection apparatus is provided, and on the basis of the above embodiment, the above extraction module 11 may include an extraction unit, a dimension reduction unit, and a determination unit, where:

the extraction unit is used for respectively extracting the characteristics of each video frame of the video image to obtain a characteristic image corresponding to each video frame;

the dimension reduction unit is used for respectively carrying out dimension reduction processing on the feature maps corresponding to the video frames to obtain one-dimensional feature vectors corresponding to the video frames;

and the determining unit is used for obtaining the feature data corresponding to the video image according to the one-dimensional feature vector corresponding to each video frame.

The determining unit is further configured to perform stitching processing on the one-dimensional feature vectors corresponding to the video frames to obtain two-dimensional stitching feature vectors; the two dimensions of the two-dimensional splicing feature vector comprise a one-dimensional feature vector dimension of each video frame and a quantity dimension of each video frame; splicing the two-dimensional splicing feature vectors and the number of the video images to obtain three-dimensional feature vectors, and determining the three-dimensional feature vectors as feature data corresponding to the video images; the third dimension of the three-dimensional feature vector is the number dimension of the video image.

In another embodiment, another image detection apparatus is provided, and on the basis of the above embodiment, the above screening module 12 may include an affine unit and a screening unit, where:

the affine unit is used for carrying out affine transformation processing on the feature data corresponding to the video image to obtain feature data after affine transformation; the feature data after affine transformation is four-dimensional feature data;

and the screening unit is used for carrying out feature screening processing on the feature data after affine transformation by adopting the capsule network and a dynamic routing algorithm in the capsule network to obtain target feature data.

Optionally, the four dimensions of the feature data after affine transformation include: the number dimension of the video images, the one-dimensional feature vector dimension of each video frame, the number dimension of the neurons on the next layer of the convolutional layer in the capsule network and the dimension of the neurons on the next layer of the convolutional layer in the capsule network; wherein, the next layer of the convolution layer in the capsule network is the next layer of the convolution layer in which the characteristic data corresponding to the video image is positioned.

Optionally, the screening unit is further configured to perform feature screening processing on the feature data after affine transformation by using a capsule network and a dynamic routing algorithm in the capsule network to obtain initial feature data; the initial characteristic data is three-dimensional characteristic data; normalizing the initial characteristic data to obtain target characteristic data; the target feature data is two-dimensional feature data.

Optionally, the three dimensions of the initial feature data include: the number dimension of the video images, the number dimension of the neurons on the next layer of the convolutional layer in the capsule network and the dimension of the neurons on the next layer of the convolutional layer in the capsule network; the two dimensions of the target feature data include: the number dimension of the video images and the number dimension of the neurons on the next layer of convolutional layers in the capsule network.

For specific limitations of the image detection apparatus, reference may be made to the above limitations of the image detection method, which are not described herein again. The modules in the image detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring a video image; the video image comprises at least two video frames, and each video frame of the video image comprises an object to be detected; respectively extracting the characteristics of each video frame of the video image to obtain characteristic data corresponding to the video image; carrying out feature screening processing on feature data corresponding to the video image to obtain target feature data; and inputting the target characteristic data into a classification model for classification, and determining the class of the object to be detected.

In one embodiment, the processor, when executing the computer program, further performs the steps of:

respectively extracting the characteristics of each video frame of the video image to obtain a characteristic image corresponding to each video frame; respectively carrying out dimension reduction processing on the feature maps corresponding to the video frames to obtain one-dimensional feature vectors corresponding to the video frames; and obtaining the characteristic data corresponding to the video image according to the one-dimensional characteristic vector corresponding to each video frame.

splicing the one-dimensional characteristic vectors corresponding to the video frames to obtain two-dimensional spliced characteristic vectors; the two dimensions of the two-dimensional splicing feature vector comprise a one-dimensional feature vector dimension of each video frame and a quantity dimension of each video frame; splicing the two-dimensional splicing feature vectors and the number of the video images to obtain three-dimensional feature vectors, and determining the three-dimensional feature vectors as feature data corresponding to the video images; the third dimension of the three-dimensional feature vector is the number dimension of the video image.

carrying out affine transformation processing on the feature data corresponding to the video image to obtain feature data after affine transformation; the feature data after affine transformation is four-dimensional feature data; and performing feature screening processing on the feature data after affine transformation by adopting a capsule network and a dynamic routing algorithm in the capsule network to obtain target feature data.

performing feature screening processing on feature data after affine transformation by adopting a capsule network and a dynamic routing algorithm in the capsule network to obtain initial feature data; the initial characteristic data is three-dimensional characteristic data; normalizing the initial characteristic data to obtain target characteristic data; the target feature data is two-dimensional feature data.

In one embodiment, a readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

It will be understood by those of ordinary skill in the art that all or a portion of the processes of the methods of the embodiments described above may be implemented by a computer program that may be stored on a non-volatile computer-readable storage medium, which when executed, may include the processes of the embodiments of the methods described above, wherein any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image detection method, characterized in that the method comprises:

performing characteristic screening processing on the characteristic data corresponding to the video image to obtain target characteristic data;

2. The method according to claim 1, wherein the performing feature extraction on each video frame of the video image to obtain feature data corresponding to the video image comprises:

respectively performing dimension reduction processing on the feature maps corresponding to the video frames to obtain one-dimensional feature vectors corresponding to the video frames;

and obtaining the feature data corresponding to the video image according to the one-dimensional feature vector corresponding to each video frame.

3. The method according to claim 2, wherein obtaining the feature data corresponding to the video image according to the one-dimensional feature vector corresponding to each video frame comprises:

splicing the one-dimensional feature vectors corresponding to the video frames to obtain two-dimensional spliced feature vectors; the two dimensions of the two-dimensional splicing feature vector comprise a one-dimensional feature vector dimension of each video frame and a quantity dimension of each video frame;

4. The method according to any one of claims 1 to 3, wherein the performing the feature screening process on the feature data corresponding to the video image to obtain the target feature data comprises:

5. The method of claim 4, wherein the four dimensions of the affine transformed feature data comprise: the number dimension of the video images, the one-dimensional feature vector dimension of each video frame, the number dimension of the neurons on the next layer of the convolutional layer in the capsule network, and the dimension of the neurons on the next layer of the convolutional layer in the capsule network; and the next layer of convolution layer in the capsule network is the next layer of convolution layer of the convolution layer where the characteristic data corresponding to the video image is located.

6. The method according to claim 5, wherein the performing feature screening processing on the affine-transformed feature data by using a capsule network and a dynamic routing algorithm in the capsule network to obtain target feature data comprises:

performing feature screening processing on the feature data after affine transformation by adopting a capsule network and a dynamic routing algorithm in the capsule network to obtain initial feature data; the initial characteristic data is three-dimensional characteristic data;

7. The method of claim 6, wherein the three dimensions of the initial feature data comprise: the number dimension of the video images, the number dimension of the neurons on the next layer of convolutional layers in the capsule network, and the dimension of the neurons on the next layer of convolutional layers in the capsule network; the two dimensions of the target feature data include: a number dimension of the video images and a number dimension of neurons on a next layer of convolutional layers in the capsule network.

8. An image detection apparatus, characterized in that the apparatus comprises:

and the classification module is used for inputting the target characteristic data into a classification model for classification and determining the class of the object to be detected.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.