CN113869205A

CN113869205A - Object detection method and device, electronic equipment and storage medium

Info

Publication number: CN113869205A
Application number: CN202111138313.XA
Authority: CN
Inventors: 叶锦; 谭啸; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-31

Abstract

The present disclosure provides an object detection method, an object detection device, an electronic device, and a storage medium, which relate to the field of artificial intelligence, specifically to computer vision and deep learning technologies, and specifically can be used in smart cities and smart traffic scenes. The scheme is as follows: aiming at a target video frame in a video to be detected, an encoder is adopted to encode the target video frame to obtain a first encoding characteristic; acquiring a set first decoding characteristic, or determining the first decoding characteristic according to a characteristic obtained by a decoder decoding a video frame before a target video frame; decoding the first coding characteristic and the first decoding characteristic by a decoder to obtain a second decoding characteristic; and predicting the object by adopting the full-connection layer according to the second decoding characteristic to obtain a labeling result of the target video frame, wherein the labeling result comprises a detection frame of the target object and the type of the target object in the detection frame. Therefore, each object in the video to be detected is identified and obtained based on the deep learning technology, and the accuracy of the detection result can be improved.

Description

Object detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to computer vision and deep learning technologies, which are particularly applicable to smart cities and smart traffic scenarios, and in particular to an object detection method, apparatus, electronic device, and storage medium.

Background

Under the smart city and the intelligent traffic scene, objects or targets such as vehicles, pedestrians and objects in the video are accurately detected, and the tasks such as abnormal event detection, prisoner tracking and vehicle statistics can be helped. Therefore, how to detect the target in the video is very important.

Disclosure of Invention

The disclosure provides a method, an apparatus, an electronic device and a storage medium for object detection.

According to an aspect of the present disclosure, there is provided an object detection method including:

acquiring a video to be detected;

aiming at a target video frame in the video to be detected, an encoder in an object detection model is adopted to encode the target video frame to obtain a first encoding characteristic;

acquiring a set first decoding characteristic, or determining the first decoding characteristic according to a characteristic obtained by a decoder in the object detection model decoding a video frame before the target video frame;

decoding the first encoding characteristic and the first decoding characteristic by using the decoder to obtain a second decoding characteristic;

and predicting the object by adopting a full-link layer in the object detection model according to the second decoding characteristic to obtain an annotation result of the target video frame, wherein the annotation result comprises a detection frame of the target object and the category of the target object in the detection frame.

According to another aspect of the present disclosure, there is provided an object detecting apparatus including:

the first acquisition module is used for acquiring a video to be detected;

the encoding module is used for encoding a target video frame in the video to be detected by adopting an encoder in an object detection model to obtain a first encoding characteristic;

a second obtaining module, configured to obtain a set first decoding feature, or determine the first decoding feature according to a feature obtained by a decoder in the object detection model decoding a video frame that is previous to the target video frame;

a decoding module, configured to decode the first encoding characteristic and the first decoding characteristic by using the decoder to obtain a second decoding characteristic;

and the prediction module is used for performing object prediction according to the second decoding characteristic by adopting a full connection layer in the object detection model so as to obtain an annotation result of the target video frame, wherein the annotation result comprises a detection frame of a target object and the type of the target object in the detection frame.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of object detection as set forth in the above aspect of the disclosure.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium of computer instructions for causing a computer to perform the object detection method set forth in the above-described aspect of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the object detection method set forth in the above-mentioned aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of an object detection method according to a first embodiment of the disclosure;

fig. 2 is a schematic flowchart of an object detection method according to a second embodiment of the disclosure;

fig. 3 is a schematic flowchart of an object detection method according to a third embodiment of the disclosure;

FIG. 4 is a schematic diagram of the basic principle of an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an object detection apparatus according to a fourth embodiment of the present disclosure;

FIG. 6 illustrates a schematic block diagram of an example electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In video object detection, missing detection and/or false detection of an object may occur, and the missing detection and/or false detection may occur in consecutive video frames, and therefore, it is very important how to enhance the model capability to reduce the missing detection and false detection of the object.

In the related art, the target in the video can be detected by the following target detection techniques:

first, the DETR (Detection Transformer, visual version of Transformer) model: using a common backbone network backbone (such as a residual error network ResNet, etc.), extracting features of a video frame to generate a feature map, inputting the feature map into a model using a transform as a basic mechanism to perform encoding-decoding, and finally outputting a detection frame and a category of a target object in the detection frame.

Second, the Deformable (Deformable) DETR model: for the modified version of the DETR model, FPN (Feature Pyramid Network) is used to increase the number of input features of the Transformer, and deformable convolution layer (conv) is used to accelerate the Transformer structure.

Thirdly, performing relational modeling on detection candidate frames (propofol) of front and rear frames of video frames in the video in an attention mode, so as to improve the accuracy of a final detection result.

Fourthly, the space and the time information in the video are related by a time and space coding mode to segment the target in the video.

The target detection method in the video mainly uses a two-stage mode, namely, RPN (regional recommendation Network) is used to extract a candidate frame (propsal), and then a relation modeling is performed on the candidate frame (propsals) of a time domain + a space domain (mainly, a mode of attention/GCN (Graph Convolutional Network)/non-local) is adopted to improve a target detection effect. That is, the conventional transform-based video object detection/segmentation method mainly improves the detection/segmentation capability of the model by associating features between different video frames, but does not consider fusing information between different video frames from the query (query) level.

In view of the above problems, the present disclosure provides an object detection method, an object detection apparatus, an electronic device, and a storage medium.

An object detection method, an apparatus, an electronic device, and a storage medium of the embodiments of the present disclosure are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of an object detection method according to a first embodiment of the disclosure.

The object detection method is exemplified by being configured in an object detection apparatus, which can be applied to any electronic device, so that the electronic device can perform an object detection function.

The electronic device may be any device with computing capability, for example, a personal computer, a mobile terminal, a server, and the like, and the mobile terminal may be a hardware device with various operating systems, touch screens, and/or display screens, such as an in-vehicle device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, and the like.

As shown in fig. 1, the object detection method may include the steps of:

step 101, acquiring a video to be detected.

In this disclosure, the video to be detected may be a video acquired online, for example, the video to be detected may be acquired online through a web crawler technology, or the video to be detected may also be a video acquired offline, or the video to be detected may also be a video stream acquired in real time, or the video to be detected may also be a video synthesized by human, and the like, which is not limited in this disclosure.

And 102, aiming at a target video frame in a video to be detected, encoding the target video frame by using an encoder in an object detection model to obtain a first encoding characteristic.

In this embodiment of the present disclosure, the target video frame may be any one frame of video frame in the video to be detected, or the target video frame may also be a key frame in the video to be detected, and the like, which is not limited in this disclosure.

In the embodiment of the present disclosure, the object detection model is configured to identify each target object in the target video frame, and output a detection frame of each target object and a category of the target object in the detection frame, where the category may include a category of a vehicle, a person, and the like.

In the embodiment of the present disclosure, the structure of the object detection model is not limited, for example, the object detection model may be a model with a Transformer as a basic structure, or may also be a model with another structure, such as a model with a Transformer variant structure.

In the embodiment of the present disclosure, the object detection model is a trained model, for example, the initial object detection model may be trained based on a machine learning technique or a deep learning technique, so that the trained object detection model can learn a corresponding relationship between a video frame or an image and a labeling result, where the labeling result may include a detection frame of an object and a category of the object in the detection frame, and the object may include any object such as a vehicle, a person, an object, an animal, and the like.

In the embodiment of the present disclosure, for a target video frame in a video to be detected, an encoder in an object detection model may be used to encode the target video frame, so as to obtain a first encoding characteristic of the target video frame.

Step 103, obtaining a set first decoding feature, or determining the first decoding feature according to a feature obtained by a decoder in the object detection model decoding a video frame previous to the target video frame.

In a possible implementation manner of the embodiment of the present disclosure, when a target video frame is a non-first frame video frame in a video to be detected, a feature obtained by a decoder in an object detection model decoding a previous frame video frame of the target video frame may be obtained, and a first coding feature is determined according to the feature obtained by the decoder decoding the previous frame video frame of the target video frame. For example, a feature obtained by decoding a video frame previous to the target video frame by the decoder may be used as the first decoding feature.

In another possible implementation manner of the embodiment of the present disclosure, when the target video frame is a first frame video frame in a video to be detected, since the target video frame does not have a previous frame video frame as a reference, a set first decoding feature may be obtained.

And 104, decoding the first coding characteristic and the first decoding characteristic by using a decoder to obtain a second decoding characteristic.

In embodiments of the present disclosure, a decoder may be employed to decode the first encoded feature and the first decoded feature to obtain the second decoded feature.

And 105, performing object prediction by adopting a full connection layer in the object detection model according to the second decoding characteristic to obtain an annotation result of the target video frame, wherein the annotation result comprises a detection frame of the target object and the category of the target object in the detection frame.

In the disclosed embodiment, the target object may include any object such as a vehicle, a person, an object, an animal, and the like.

In the embodiment of the present disclosure, the FC (full Connected layers) in the object detection model may be adopted to perform object prediction according to the second decoding characteristic, so as to obtain the labeling result of the target video frame. The labeling result may include a detection box of each target object in the target video frame and a category of the target object in the detection box.

It is understood that at least one target object may be included in the target video frame, for example, there may be multiple vehicles and/or multiple pedestrians in the target video frame, and therefore, at least one detection frame and the category of the target object in each detection frame may be included in the labeling result.

As an application scenario, video streams acquired at each intersection can be acquired as a video to be detected, so that each object (such as a vehicle and a pedestrian) in the video to be detected can be detected by using an object detection model, and a labeling result can be obtained, so that vehicle statistics (such as traffic flow statistics, violation vehicle statistics and the like), abnormal event detection (such as detection of vehicles and pedestrians violating traffic rules), criminal tracking and the like can be performed according to the labeling result.

The object detection method of the embodiment of the disclosure includes that a target video frame in a video to be detected is encoded by an encoder in an object detection model to obtain a first encoding characteristic, and a set first decoding characteristic is obtained, or the first decoding characteristic is determined according to a characteristic obtained by a decoder in the object detection model decoding a previous frame video frame of the target video frame, then the first encoding characteristic and the first decoding characteristic are decoded by the decoder to obtain a second decoding characteristic, and finally an object prediction is performed by a full connection layer in the object detection model according to the second decoding characteristic to obtain a labeling result of the target video frame, wherein the labeling result includes a detection frame of the target object and a category of the target object in the detection frame. Therefore, each object in the video to be detected is identified and obtained based on the deep learning technology, and the accuracy of the detection result can be improved.

In the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related users all conform to the regulations of the related laws and regulations, and do not violate the good custom of the public order.

In order to clearly illustrate how the second decoding characteristic is obtained in the above embodiments of the present disclosure, the present disclosure also proposes an object detection method.

Fig. 2 is a schematic flowchart of an object detection method according to a second embodiment of the disclosure.

As shown in fig. 2, the object detection method may include the steps of:

step 201, acquiring a video to be detected.

Step 202, aiming at a target video frame in a video to be detected, an encoder in an object detection model is adopted to encode the target video frame to obtain a first encoding characteristic.

Step 203, obtaining a set first decoding feature, or determining the first decoding feature according to a feature obtained by a decoder in the object detection model decoding a video frame previous to the target video frame.

The execution process of steps 201 to 203 may refer to the execution process of any of the above embodiments, which is not described herein again.

Step 204, inputting the first coding feature to a first full-connection layer and a second full-connection layer in a decoder respectively.

Step 205, the feature vector output by the first fully-connected layer is taken as a first attention parameter, and the feature vector output by the second fully-connected layer is taken as a second attention parameter.

In the embodiment of the present disclosure, the first encoding feature may be input to a first full-link layer and a second full-link layer in the decoder, respectively, and a feature vector output by the first full-link layer may be used as the first attention parameter, and a feature vector output by the second full-link layer may be used as the second attention parameter. For example, the first attention parameter may be a Key parameter or a Key feature in the attention mechanism, and the second attention parameter may be a Value parameter or a Value feature in the attention mechanism.

In step 206, a third attention parameter is determined according to the first decoding feature.

In the disclosed embodiment, the third attention parameter may be a Query parameter or a Query feature in the attention mechanism.

In a possible implementation manner of the embodiment of the present disclosure, the first decoding characteristic may be used as a third attention parameter corresponding to the target video frame.

In another possible implementation manner of the embodiment of the present disclosure, a third attention parameter corresponding to a video frame previous to the target video frame may be obtained, and the first decoding feature and the third attention parameter corresponding to the video frame previous to the target video frame are subjected to weighting processing to obtain the third attention parameter corresponding to the target video frame.

As an example, the target video frame is taken as a first frame video frame in the video to be detected, that is, a 1 st frame video frame, and the first decoding feature may be a set feature, and the set first decoding feature may be taken as a third attention parameter (such as a Query parameter) corresponding to the target video frame.

As another example, a target video frame is taken as a non-first frame video frame in a video to be detected, that is, an nth frame video frame, where n is a positive integer greater than 1, and the first decoding feature may be a feature obtained by decoding an nth-1 frame video frame by a decoder, that is, the decoder may determine the first decoding feature corresponding to the nth-1 frame video frame according to a second encoding feature obtained by encoding the nth-1 frame video frame by an encoder and a third attention parameter corresponding to the nth-1 frame video frame. Specifically, the second encoding characteristics may be input to a first full-link layer and a second full-link layer in the decoder, respectively, the feature vector output by the first full-link layer is taken as a first attention parameter corresponding to the n-1 th frame of video frame, and the feature vector output by the second full-link layer is taken as a second attention parameter corresponding to the n-1 th frame of video frame, so that the first decoding characteristics corresponding to the n-1 th frame of video frame may be determined according to the first attention parameter, the second attention parameter, and the third attention parameter corresponding to the n-1 th frame of video frame.

In the present disclosure, the first decoding feature corresponding to the n-1 frame of video frame and the third attention parameter corresponding to the n-1 frame of video frame may be subjected to weighted summation processing to obtain the third attention parameter corresponding to the n-1 frame of video frame. Or, the first decoding feature corresponding to the n-1 frame video frame can be directly used as the third attention parameter corresponding to the n frame video frame.

Therefore, the third attention parameter can be determined based on different modes, and the flexibility and the applicability of the method can be improved.

And step 207, determining a second decoding characteristic corresponding to the target video frame according to the first attention parameter, the second attention parameter and the third attention parameter.

In an embodiment of the disclosure, the decoder may determine, based on the attention mechanism, a second decoding characteristic corresponding to the target video frame according to the first attention parameter, the second attention parameter, and the third attention parameter.

In a possible implementation manner of the embodiment of the present disclosure, normalization may be performed after inner product of a third attention parameter corresponding to a target video frame and a first attention parameter is performed, so as to obtain an attention weight, and a second attention parameter is weighted according to the attention weight, so as to obtain a second decoding feature corresponding to the target video frame. Therefore, according to the attention mechanism, the second decoding characteristic corresponding to the target video frame is determined, and the reliability of the decoding characteristic determination result can be improved.

That is, the second decoding characteristic may be determined according to the following formula:

wherein Q represents a third attention parameter (i.e., a Query parameter), K represents a first attention parameter (i.e., a Key parameter), V represents a second attention parameter (i.e., a Value parameter), d_kRepresenting the normalization factor and T representing the transpose operation of the matrix.

And 208, performing object prediction by using a full connection layer in the object detection model according to the second decoding characteristic to obtain an annotation result of the target video frame, wherein the annotation result comprises a detection frame of the target object and the category of the target object in the detection frame.

The execution process of step 208 may refer to the execution process of any of the above embodiments, and is not described herein again.

The object detection method of the embodiment of the present disclosure inputs first coding features to a first full-link layer and a second full-link layer in a decoder, respectively; taking the feature vector output by the first full-connection layer as a first attention parameter, and taking the feature vector output by the second full-connection layer as a second attention parameter; determining a third attention parameter according to the first decoding characteristic; and determining a second decoding characteristic corresponding to the target video frame according to the first attention parameter, the second attention parameter and the third attention parameter. Therefore, the encoder decodes the first encoding characteristic and the first decoding characteristic based on the attention mechanism to obtain the second encoding characteristic, and the accuracy and the reliability of the determination result of the decoding characteristic can be improved.

In order to clearly illustrate how the target video frame is encoded in the above embodiments of the present disclosure, the present disclosure further provides an object detection method.

Fig. 3 is a schematic flowchart of an object detection method according to a third embodiment of the present disclosure.

As shown in fig. 3, the object detection method may include the steps of:

step 301, acquiring a video to be detected.

The execution process of step 301 may refer to the execution process of the above embodiment, which is not described herein again.

Step 302, performing feature extraction on the target video frame to obtain a first image feature.

In the embodiment of the disclosure, feature extraction may be performed on a target video frame based on a feature extraction technology to obtain a first image feature.

In a possible implementation manner of the embodiment of the disclosure, in order to improve accuracy and reliability of a feature extraction result, feature extraction may be performed on a target video frame based on a deep learning technique to obtain a first image feature.

As an example, feature extraction may be performed on a target video frame using a mainstream backbone network (backbone) such as a residual network (ResNet) to obtain a first image feature. For example, a CNN (Convolutional Neural Network) shown in fig. 4 may be used to perform feature extraction on the target video frame to obtain the first image feature.

Step 303, performing block processing on the first image feature to obtain a serialized feature vector.

In the embodiment of the present disclosure, the first image feature may be subjected to block processing to obtain a serialized feature vector.

As an example, as shown in fig. 4, the image feature output by the CNN network may be a stereo image feature of C (channel) × H (height) × W (width), that is, the data size of the CNN is (channel, height, width), and the image feature output by the CNN network may be converted into a serialized feature vector sequence, for example, into H × W C-dimensional feature vectors, that is, one C-dimensional feature vector for each small block before the encoder in fig. 4.

And step 304, encoding the serialized feature vectors by using an encoder in the object detection model to obtain first encoding features.

In the embodiment of the present disclosure, the serialized feature vector may be encoded by an encoder in the object detection model to obtain the serialized first encoded feature.

Step 305, obtaining a set first decoding feature, or determining the first decoding feature according to a feature obtained by a decoder in the object detection model decoding a video frame previous to the target video frame.

As an example, as shown in fig. 4, the first decoded feature is also a serialized feature vector.

Step 306, decoding the first encoding characteristic and the first decoding characteristic by using a decoder to obtain a second decoding characteristic.

And 307, performing object prediction by using a full link layer in the object detection model according to the second decoding characteristic to obtain an annotation result of the target video frame, wherein the annotation result comprises a detection frame of the target object and the category of the target object in the detection frame.

The execution process of steps 304 to 307 may refer to the execution process of any of the above embodiments, which is not described herein again.

As an example, taking an object recognition model as a model with a transform as a basic structure, as shown in fig. 4, when a target video frame is a first frame video frame in a video to be detected, that is, when n is 2, for an n-1 frame video frame, feature extraction may be performed on the n-1 frame video frame by using a CNN network to obtain a C × H × W stereo image feature, then block processing may be performed on the stereo image feature to obtain a serialized feature vector sequence, that is, the serialized feature vector sequence is converted into H × W C-dimensional feature vectors, the serialized feature vector is input to an encoder for attention learning, the obtained feature vector sequence is input to a decoder, the decoder performs attention learning according to the input feature vector sequence and a set first coding feature (that is, a third attention parameter, such as a Query parameter, corresponding to the n-1 frame video frame), the obtained decoding features are subjected to final object detection by using an FFN (Feed-Forward Network), for example, the decoding features can be input to the FC, and the FC performs object prediction according to the decoding features to obtain a detection frame of the target object and a category of the target object in the detection frame.

When n is greater than or equal to 2, for the nth frame of video frame, the decoding feature obtained by decoding the nth-1 frame of video frame by the decoder may be used as the third attention parameter corresponding to the nth frame of video frame, or the third attention parameter corresponding to the nth-1 frame of video frame and the decoding feature obtained by decoding the nth-1 frame of video frame by the decoder may be subjected to weighted summation processing to obtain the third attention parameter corresponding to the nth frame of video frame. The decoder can perform attention learning according to the input feature sequence and a third attention parameter corresponding to the nth frame of video frame, the obtained decoding feature is input to the FC, and the FC performs object prediction according to the decoding feature to obtain a detection frame of the target object and a category of the target object in the detection frame.

In conclusion, the target object in the video can be detected by enhancing the Query feature of the transform in the decoding part, and the objects such as vehicles, pedestrians and the like in smart cities and smart traffic scenes can be accurately detected.

According to the object detection method, the first image characteristic is obtained by extracting the characteristic of the target video frame; the first image features are subjected to blocking processing to obtain serialized feature vectors; and encoding the serialized feature vectors by using an encoder to obtain first encoding features. Therefore, the first image feature is subjected to blocking processing to obtain a serialized feature vector, and the input requirement of an encoder can be met, so that the encoder can be ensured to effectively encode the image to obtain the first encoding feature.

Corresponding to the object detection method provided in the embodiments of fig. 1 to 3, the present disclosure also provides an object detection apparatus, and since the object detection apparatus provided in the embodiments of the present disclosure corresponds to the object detection method provided in the embodiments of fig. 1 to 3, the implementation manner of the object detection method is also applicable to the object detection apparatus provided in the embodiments of the present disclosure, and is not described in detail in the embodiments of the present disclosure.

Fig. 5 is a schematic structural diagram of an object detection apparatus according to a fourth embodiment of the present disclosure.

As shown in fig. 5, the object detecting apparatus 500 may include: a first acquisition module 510, an encoding module 520, a second acquisition module 530, a decoding module 540, and a prediction module 550.

The first obtaining module 510 is configured to obtain a video to be detected.

The encoding module 520 is configured to encode, by using an encoder in the object detection model, a target video frame in the video to be detected to obtain a first encoding characteristic.

The second obtaining module 530 is configured to obtain a set first decoding feature, or determine the first decoding feature according to a feature obtained by a decoder in the object detection model decoding a video frame before the target video frame.

A decoding module 540, configured to decode the first encoded feature and the first decoded feature with a decoder to obtain a second decoded feature.

And a predicting module 550, configured to perform object prediction according to the second decoding characteristic by using a full link layer in the object detection model, so as to obtain an annotation result of the target video frame, where the annotation result includes a detection frame of the target object and a category of the target object in the detection frame.

In a possible implementation manner of the embodiment of the present disclosure, the decoding module 540 may include:

an input unit for inputting the first coding feature to a first full-link layer and a second full-link layer in a decoder, respectively.

And the processing unit is used for taking the feature vector output by the first full connection layer as a first attention parameter and taking the feature vector output by the second full connection layer as a second attention parameter.

A first determining unit for determining a third attention parameter based on the first decoding feature.

And the second determining unit is used for determining a second decoding characteristic corresponding to the target video frame according to the first attention parameter, the second attention parameter and the third attention parameter.

In a possible implementation manner of the embodiment of the present disclosure, the second determining unit is specifically configured to: normalizing the third attention parameter after inner product of the first attention parameter to obtain an attention weight value; and weighting the second attention parameter according to the attention weight value to obtain a second decoding characteristic corresponding to the target video frame.

In a possible implementation manner of the embodiment of the present disclosure, the first determining unit is specifically configured to: and taking the first decoding characteristic as a third attention parameter corresponding to the target video frame.

In a possible implementation manner of the embodiment of the present disclosure, the first determining unit is specifically configured to: acquiring a third attention parameter corresponding to a previous frame video frame of the target video frame; and performing weighting processing on the first decoding characteristic and the third attention parameter corresponding to the previous frame of video frame to obtain the third attention parameter corresponding to the target video frame.

In a possible implementation manner of the embodiment of the present disclosure, the encoding module 520 is specifically configured to: performing feature extraction on a target video frame to obtain a first image feature; the first image features are subjected to blocking processing to obtain serialized feature vectors; and encoding the serialized feature vectors by using an encoder to obtain first encoding features.

The object detection device of the embodiment of the disclosure obtains a first coding feature by coding a target video frame in a to-be-detected video by using a coder in an object detection model, and obtains a set first decoding feature, or determines the first decoding feature according to a feature obtained by decoding a video frame of a previous frame of the target video frame by using a decoder in the object detection model, and then decodes the first coding feature and the first decoding feature by using the decoder to obtain a second decoding feature, and finally performs object prediction by using a full link layer in the object detection model according to the second decoding feature to obtain a labeling result of the target video frame, wherein the labeling result includes a detection frame of the target object and a category of the target object in the detection frame. Therefore, each object in the video to be detected is identified and obtained based on the deep learning technology, and the accuracy of the detection result can be improved.

To implement the above embodiments, the present disclosure also provides an electronic device, which may include at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the object detection method according to any of the embodiments of the disclosure.

In order to achieve the above embodiments, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the object detection method proposed by any one of the above embodiments of the present disclosure.

In order to implement the above embodiments, the present disclosure also provides a computer program product, which includes a computer program that, when executed by a processor, implements the object detection method proposed by any of the above embodiments of the present disclosure.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device that can be used to implement embodiments of the present disclosure. The electronic device may include the server and the client in the above embodiments. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the device 600 includes a computing unit 601 which can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 602 or a computer program loaded from a storage unit 607 into a RAM (Random Access Memory) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An I/O (Input/Output) interface 605 is also connected to the bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 601 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 601 executes the respective methods and processes described above, such as the above-described object detection method. For example, in some embodiments, the object detection methods described above may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 600 via ROM 602 and/or communication unit 6010. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the object detection method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the above-described object detection method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in a conventional physical host and a VPS (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

According to the technical scheme of the embodiment of the disclosure, a target video frame in a video to be detected is encoded by an encoder in an object detection model to obtain a first encoding characteristic, a set first decoding characteristic is obtained, or the first decoding characteristic is determined according to a characteristic obtained by a decoder in the object detection model decoding a previous frame video frame of the target video frame, then the first encoding characteristic and the first decoding characteristic are decoded by the decoder to obtain a second decoding characteristic, and finally an object prediction is performed by a full connection layer in the object detection model according to the second decoding characteristic to obtain a labeling result of the target video frame, wherein the labeling result comprises a detection frame of the target object and a category of the target object in the detection frame. Therefore, each object in the video to be detected is identified and obtained based on the deep learning technology, and the accuracy of the detection result can be improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An object detection method, the method comprising the steps of:

acquiring a video to be detected;

2. The method of claim 1, wherein said decoding, with the decoder, the first encoded feature and the first decoded feature to obtain a second decoded feature comprises:

inputting the first coding features to a first fully-connected layer and a second fully-connected layer in the decoder, respectively;

taking the feature vector output by the first fully-connected layer as a first attention parameter, and taking the feature vector output by the second fully-connected layer as a second attention parameter;

determining a third attention parameter based on the first decoding feature;

and determining a second decoding characteristic corresponding to the target video frame according to the first attention parameter, the second attention parameter and the third attention parameter.

3. The method of claim 2, wherein the determining a second decoded feature corresponding to a target video frame from the first attention parameter, the second attention parameter, and the third attention parameter comprises:

normalizing the third attention parameter after inner product of the first attention parameter to obtain an attention weight value;

and weighting the second attention parameter according to the attention weight value to obtain a second decoding characteristic corresponding to the target video frame.

4. The method of claim 2, wherein the determining a third attention parameter from the first decoded feature comprises:

and taking the first decoding characteristic as a third attention parameter corresponding to the target video frame.

5. The method of claim 2, wherein the determining a third attention parameter from the first decoded feature comprises:

acquiring a third attention parameter corresponding to a previous frame of video frame of the target video frame;

and performing weighting processing on the first decoding characteristic and a third attention parameter corresponding to the previous frame of video frame to obtain a third attention parameter corresponding to the target video frame.

6. The method according to any one of claims 1 to 5, wherein the encoding, with respect to a target video frame in the video to be detected, the target video frame by using an encoder in an object detection model to obtain a first encoding characteristic includes:

extracting the features of the target video frame to obtain first image features;

carrying out block processing on the first image features to obtain serialized feature vectors;

and encoding the serialized feature vector by using the encoder to obtain the first encoding feature.

7. An object detection apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a video to be detected;

8. The apparatus of claim 7, wherein the decoding module comprises:

an input unit for inputting the first coding feature to a first fully-connected layer and a second fully-connected layer in the decoder, respectively;

the processing unit is used for taking the feature vector output by the first full connection layer as a first attention parameter and taking the feature vector output by the second full connection layer as a second attention parameter;

a first determining unit, configured to determine a third attention parameter according to the first decoding feature;

a second determining unit, configured to determine a second decoding feature corresponding to the target video frame according to the first attention parameter, the second attention parameter, and the third attention parameter.

9. The apparatus according to claim 8, wherein the second determining unit is specifically configured to:

10. The apparatus according to claim 8, wherein the first determining unit is specifically configured to:

11. The apparatus according to claim 8, wherein the first determining unit is specifically configured to:

12. The apparatus according to any one of claims 7-11, wherein the encoding module is specifically configured to:

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the object detection method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the object detection method according to any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the object detection method of any one of claims 1 to 6.