CN114187540A

CN114187540A - Object identification method, device, server and computer readable storage medium

Info

Publication number: CN114187540A
Application number: CN202010861852.5A
Authority: CN
Inventors: 舒畅; 朴圣浩
Original assignee: Digital Finance Ltd
Current assignee: Digital Finance Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2022-03-15

Abstract

The embodiment of the invention provides an object identification method, an object identification device, a server and a computer readable storage medium, wherein the method comprises the following steps: acquiring a feature mapping set of each frame of picture in N frames of pictures included in acquired video data, wherein the feature mapping set is used for indicating features of an object included in each frame of picture; determining a target convolution set of the N frames of pictures and auxiliary data of an object in each frame of picture according to the feature mapping set of each frame of picture and a recorded reference feature mapping set, wherein the auxiliary data is used for indicating an image area of the object corresponding to the picture, the reference feature mapping set is a convolution set of the processed previous N frames of pictures, and the reference feature mapping set is used for indicating morphological features of the object; and carrying out classification regression processing on the auxiliary data of the object and the target convolution set to obtain the class indication data and the corresponding position data of each object in the N frames of pictures, so that the efficiency of identifying the objects in the video can be improved.

Description

Object identification method, device, server and computer readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an object identification method, an object identification apparatus, a server, and a computer-readable storage medium.

Background

Currently, in order to accurately identify a corresponding target object in an image, an example segmentation mask-rcnn technology can be adopted, which can accurately identify the target object in the image, but only processes a picture, and cannot identify a video. The Long Short-Term Memory network (LSTM) technology can identify a target object in a video, but the implementation of the technology relies on a previous frame of picture to process a picture frame, and only the picture frame by frame can be processed, which greatly affects the efficiency of object identification in the video. Therefore, how to improve the efficiency of object recognition in video is a problem to be solved at present.

Disclosure of Invention

The embodiment of the invention provides an object identification method, an object identification device, a server and a computer readable storage medium, so that a plurality of frames of pictures included in video data can be processed in parallel, and the efficiency of identifying objects in videos is improved.

A first aspect of an embodiment of the present invention provides an object identification method, including:

acquiring a feature mapping set of each frame of picture in N frames of pictures included in acquired video data, wherein the feature mapping set is used for indicating features of an object included in each frame of picture, and N is an integer greater than or equal to 1;

determining a target convolution set of the N frames of pictures and auxiliary data of an object in each frame of picture according to the feature mapping set of each frame of picture and a recorded reference feature mapping set, wherein the auxiliary data is used for indicating a corresponding image area of the object in the picture, the reference feature mapping set is a convolution set of a previous batch of N frames of pictures, and the reference feature mapping set is used for indicating morphological features of the object;

and carrying out classification regression processing on the auxiliary data of the object and the target convolution set to obtain class indication data and corresponding position data of each object included in the N frames of pictures.

A second aspect of an embodiment of the present invention provides an object recognition apparatus, including:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a feature mapping set of each frame of picture in N frames of pictures included in acquired video data, the feature mapping set is used for indicating the feature of an object included in each frame of picture, and N is an integer greater than or equal to 1;

a determining module, configured to determine, according to the feature mapping set of each frame of picture and a recorded reference feature mapping set, a target convolution set of the N frames of pictures and auxiliary data of an object in each frame of picture, where the auxiliary data is used to indicate an image area of the object corresponding to the picture in the picture, the reference feature mapping set is a convolution set of processed previous N frames of pictures, and the reference feature mapping set is used to indicate a morphological feature of the object;

and the processing module is used for carrying out classification regression processing on the auxiliary data of the object and the target convolution set to obtain the class indication data and the corresponding position data of each object in the N frames of pictures.

A third aspect of embodiments of the present invention provides a server, including a processor, a network interface, and a storage device, where the processor, the network interface, and the storage device are connected to each other, where the network interface is controlled by the processor to send and receive data, and the storage device is used to store a computer program, where the computer program includes program instructions, and the processor is configured to call the program instructions to execute the object identification method according to the first aspect.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, in which program instructions are stored, and when the program instructions are executed, the computer-readable storage medium is used for implementing the object identification method according to the first aspect

A fifth aspect of embodiments of the present invention provides a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the object identification method according to the first aspect.

In the embodiment of the invention, the server acquires the feature mapping set of each frame of the N frames of pictures included in the acquired video data, and determining a target convolution set of the N frames of pictures and auxiliary data of an object in each frame of picture according to the feature mapping set of each frame of picture and the recorded reference feature mapping set, the auxiliary data is used for indicating the corresponding image area of the object in the pictures, the reference feature mapping set is a convolution set of the processed previous N frames of pictures, the reference feature mapping set is used for indicating the morphological features of the object, and then the auxiliary data of the object and the target convolution set are subjected to classification regression processing to obtain the class indication data and the corresponding position data of each object in the N frames of pictures, therefore, the object of the multi-frame pictures in the video can be identified in parallel, and the efficiency of identifying the object in the video is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1a is a schematic structural diagram of an example segmentation mask-rcnn according to an embodiment of the present invention;

FIG. 1b is a schematic structural diagram of an LSTM according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an object recognition model according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of an object identification method according to an embodiment of the present invention;

FIG. 4a is a diagram of an input frame of a picture according to an embodiment of the present invention;

FIG. 4b is a diagram of another frame of picture input according to the embodiment of the present invention;

FIG. 5 is a schematic flow chart of another object recognition method provided by the embodiment of the invention;

fig. 6 is a schematic structural diagram of an object recognition apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, an example segmentation mask-rcnn technology and an LSTM technology are respectively adopted for object identification in an image and object identification in a video, and a process of identifying an object in an image by using the example segmentation mask-rcnn technology is as shown in fig. 1a, and each frame of picture is processed by a skeleton network, a convolution kernel, a candidate set network, classification regression and the like, so as to realize object identification in an image, wherein the classification regression can be semantic segmentation, frame prediction and classification, but the technology cannot be used for object identification in a video. However, as shown in fig. 1b, each frame of picture is input, after Embedding layer Embedding processing is performed on the frame of picture, and classification processing is performed through processing of each hidden layer and all-connected layer, the object in the video can be identified, but as can be seen from fig. 1b, each frame of picture needs to rely on the processing result of the previous hidden layer to perform processing of the next hidden layer, for example, the processing of the hidden layer 12 needs to rely on the processing result of the hidden layer 11, so that when identifying the object in the video, the frame of picture needs to be identified one by one, and thus the efficiency of identifying the object in the video is low.

Based on the defects of the mask-rcnn technology and the LSTM technology in the example segmentation, the embodiment of the invention provides an object identification method, which can process multiple frames of pictures included in a video in parallel, thereby realizing the identification of an object in the video and improving the efficiency of the object identification. In a specific implementation, a feature mapping set of each frame of picture in N frames of pictures included in the acquired video data may be obtained first, a target convolution set of the N frames of pictures and auxiliary data of an object in each frame of picture are determined according to the feature mapping set of each frame of picture and a recorded reference feature mapping set, and classification regression processing is performed on the auxiliary data of the object and the target convolution set to obtain category indication data and corresponding position data of each object included in the N frames of pictures. The auxiliary data is used for indicating the corresponding image area of the object in the pictures, the reference feature mapping set is a convolution set of the processed previous N frames of pictures, and the reference feature mapping set is used for indicating the morphological features of the object.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an object recognition model according to an embodiment of the present invention. The object recognition model may be a modified mask-rcnn model, as shown in fig. 2, and includes an input module 201, a skeleton network 202, a processing module 203, and a training module 204, where:

the input module 201 may obtain N frames of pictures included in the acquired video data, and input the N frames of pictures into the skeleton network 202 in parallel, the skeleton network 202 may process the input N frames of pictures to obtain a feature mapping set of each frame of the N frames of pictures, the processing module 203 may calculate the N frames of pictures to obtain weights of each frame of pictures to determine importance of corresponding object features in each frame of pictures, further, the processing module 203 may determine a feature vector set of each frame of pictures and a reference feature mapping set of N frames of pictures of a previous batch to perform similarity calculation, and obtain similarity of each frame of pictures, determine a weight of each frame of pictures according to the similarity of each frame of pictures, perform weighted average on the weight of each frame of pictures and the feature mapping set of each frame of pictures to obtain an average feature mapping set, which may also be referred to as a convolution set, and storing the convolution set as a reference feature mapping set of the next group of N frames of pictures. The training module 204 may put the convolution set into a candidate set network for training and classification regression, so as to obtain class indication data and corresponding position data of the object in the N frames of pictures.

Referring to fig. 3, fig. 3 is a schematic flow chart of an object identification method according to an embodiment of the present invention. The object identification method described in this embodiment may be executed by a server, where the server may execute any one of the modules in the object identification model, including:

301. and acquiring a feature mapping set of each frame of picture in N frames of pictures included in the acquired video data.

The feature mapping set is used for indicating features of objects included in each frame of picture, N is an integer greater than or equal to 1, and the feature mapping set may refer to a multi-dimensional array nested set, such as a three-dimensional array nested set, and values in the multi-dimensional array nested set may be used for indicating features of objects included in each frame of picture. Alternatively, the feature mapping set of each frame of picture includes a feature vector of the object, the feature vector indicating features of the object included in each frame of picture.

In a possible embodiment, the N-frame picture may be an unprocessed N-frame picture obtained from the captured video data, or the N-frame picture may be obtained by the server first obtaining an unprocessed N-frame picture from the captured video data and preprocessing the unprocessed N-frame picture.

In a feasible embodiment, the server collects video data, divides the video data into original N frames of pictures, and preprocesses each frame of the original N frames of pictures to obtain N frames of pictures included in the video data, wherein the preprocessing includes performing gray processing on each original frame of pictures and performing filling processing on the size of each frame of pictures after the gray processing. The division of the video data into the original N frames of pictures may be set according to user requirements, for example, the video data may be divided into 30 frames of pictures.

In a feasible embodiment, the server may obtain N frames of pictures included in the acquired video data, and input the N frames of pictures into the skeleton network to sequentially perform operations such as convolution, pooling, activation, and the like, thereby obtaining a feature mapping set of each frame of picture in the N frames of pictures.

302. And determining a target convolution set of the N frames of pictures and auxiliary data of an object in each frame of picture according to the feature mapping set of each frame of picture and the recorded reference feature mapping set.

The auxiliary data is used for indicating the corresponding image area of the object in the pictures, the reference feature mapping set is a convolution set of the processed previous N frames of pictures, the reference feature mapping set is used for indicating the morphological feature of the object, for example, the object may be a pig, and the reference feature mapping set is used for indicating the morphological feature of the pig.

Specifically, the server determines a target convolution set according to the feature mapping set of each frame of picture and the convolution set of the processed previous N frames of pictures, trains the target convolution set in a candidate set network, and determines auxiliary data of an object in each frame of picture. It can be understood that training the target convolution set in the candidate set network can remove the background data and the redundant candidate frame data of the object in each frame of picture, thereby determining the auxiliary data of the object in each frame of picture.

In a possible embodiment, if the server acquires the feature mapping set of each frame of the N frames of pictures included in the captured video data for the first time, the recorded reference feature mapping set may be preset to a value, for example, the reference feature mapping set may be set to 1.

In a possible embodiment, the implementation manner of the server determining the target convolution set of the N frames of pictures according to the feature mapping set of each frame of picture and the recorded reference feature mapping set may be: and determining the similarity between each frame of picture and the reference feature mapping set by using the feature mapping set of each frame of picture and the recorded reference feature mapping set, after determining the similarity of each frame of picture, determining the weight of each frame of picture according to the similarity of each frame of picture, and determining a target convolution set according to the weight of each frame of picture and the feature mapping set of each frame of picture.

303. And carrying out classification regression processing on the auxiliary data of the object and the target convolution set to obtain class indication data and corresponding position data of each object in the N frames of pictures.

The category indication data of each object may be preset, for example, the category indication data of each object may be set when each object is framed in each frame of picture, a number representing a pig or a person is set when the object is framed, such as the number 1 representing a pig or the like, or the category indication data of each object may be contour information of each object. The position data is a position where the object is located, and may be expressed by coordinates.

In a feasible embodiment, the server obtains two frames of pictures, which are respectively shown in fig. 4a and fig. 4b, and the pig can be identified by processing the object identification model and the object identification method, so that the efficiency of identifying the object in the video is improved, and the output result is as follows:

category designation data [ 11111111111111111111111 ],1 represents the identification of the target as a pig. Indicating that 23 pigs were identified.

Coordinate position information of the pig: [[12334801331724],[312462461626],[144772268935],[68274796207],[528228670321],[380286529407],[749158817326],[338294418462],[274469359648],[438139560206],[798126984201],[104571098117],[99401051107],[7751588690],[11904127176],[8731897198],[357185459276],[5153463783],[54783625150],[63443696123],[9381101077169],[421199527285],[1124101180103]]

Wherein [ 12334801331724 ] represents two coordinate positions (upper left corner coordinate and lower right corner coordinate) of one pig, namely, y1, x1, y2 and x 2.

In the embodiment of the invention, the server acquires the feature mapping set of each frame of the N frames of pictures included in the acquired video data, and determining a target convolution set of the N frames of pictures and auxiliary data of an object in each frame of picture according to the feature mapping set of each frame of picture and the recorded reference feature mapping set, the auxiliary data is used for indicating the corresponding image area of the object in the pictures, the reference feature mapping set is a convolution set of the processed previous N frames of pictures, the reference feature mapping set is used for indicating morphological features of the object, and then classification regression processing is carried out on auxiliary data of the object and a target convolution set to obtain category indicating data and corresponding position data of each object in N frames of pictures, therefore, the object of the multi-frame pictures in the video can be identified in parallel, and the efficiency of identifying the object in the video is improved.

Referring to fig. 5, fig. 5 is a schematic flow chart of another object identification method according to an embodiment of the present invention. The object identification method described in this embodiment may be executed by a server, where the server may execute any one of the modules in the object identification model, including:

501. and acquiring a feature mapping set of each frame of picture in N frames of pictures included in the acquired video data.

In a feasible embodiment, the server acquires N frames of pictures included in the acquired video data, and inputs each frame of picture in the N frames of pictures into a skeleton network for convolution, pooling and activation processing, so as to obtain a feature mapping set of each frame of picture.

In a feasible embodiment, if each frame of the N frames of pictures is obtained through preprocessing, before the server acquires the N frames of pictures included in the acquired video data, dividing the acquired video data into original N frames of pictures, preprocessing each frame of the original N frames of pictures, and in a specific implementation, performing gray processing on each original frame of pictures, wherein a gray calculation formula of a pixel point in each original frame of pictures is as follows in formula 1.1:

f (i, j) ═ R (i, j) + G (i, j) + B (i, j))/3 formula 1.1

Where f (i, j) represents the gray level of the pixel, R (i, j) represents the change of the red color channel, G (i, j) represents the change of the green color channel, and B (i, j) represents the change of the blue color channel.

In a feasible embodiment, after performing gray processing on each of the original N frames of pictures, the N frames of pictures after the gray processing are obtained, and the server performs filling processing on the size of the gray image corresponding to each frame of picture, so that the object recognition model adapts to the pictures with different sizes. In a specific implementation, the server obtains a maximum length and a maximum width of N frames of pictures (i.e., the N frames of pictures after the gray processing) included in the acquired video data, determines a maximum value of the maximum length and the maximum width, and uses the maximum value as a filling value to obtain the filling value, and performs filling processing on the size of each frame of the N frames of pictures according to the filling value. For example, the server acquires that the maximum length of N frames of pictures included in the acquired video data is a and the maximum width is b, if the maximum width a is greater than the width b, the maximum value is determined to be a, and the filled N frames of pictures have the same size by taking a as a filling value.

In a possible embodiment, further, after obtaining the padding value, the server may determine the padding rule for each frame of the picture according to the padding value. In the specific implementation, for a target picture in N frames of pictures, the server calculates a picture scaling factor corresponding to the target picture according to the length and the width of the target picture, judges whether the product of the larger value of the length and the width of the target picture and the picture scaling factor is larger than a filling value, if the product of the larger value of the length and the width of the target picture and the picture scaling factor is larger than the filling value, adjusts the picture scaling factor, and fills the size of the target picture according to the adjusted picture scaling factor; and if the product of the larger value of the length and the width of the target picture and the picture scaling factor is less than or equal to the filling numerical value, filling the size of the target picture according to the picture scaling factor, wherein the target picture is any one of the N frames of pictures.

For example, taking the obtained padding value as a as an example, for a target picture in N frames of pictures, the true width of the target picture is represented by h, the length of the target picture is denoted by w, the server calculates the zoom factor of the target picture, the zoom factor is denoted by scale, the scaling factor scale is max (1, b/min (h, w)), b is the maximum width, the server judges that the product of the scaling factor and the larger value of the length w and the width h of the target picture is larger than a filling value a, that is, max (h, w) × scale > a, the picture scaling factor is adjusted, that is, the scaling factor of the adjusted picture is expressed as scale ' ═ a/max (h, w), so as to obtain the target picture processed width h ' ═ h × scale ', the target picture processed length w ' ═ w scale ', therefore, the filling processing of the size of the target picture is realized according to the adjusted picture scaling factor.

In a feasible embodiment, the server performs a specific implementation manner of filling the size of the target picture according to the adjusted picture scaling factor: the top, the bottom, the left and the right of the target picture are filled respectively, the top filling value of the target picture is int ((filling value a-width h ') after target picture processing)/2), the bottom filling value of the target picture is a filling value a-width h' -top filling value after target picture processing, the left filling value of the target picture is int ((filling value a-length w ') after target picture processing)/2), and the right filling value of the target picture is a filling value a-length w' -left filling value after target picture processing, wherein int represents rounding and decimal point removal.

It should be noted that the characteristics of the object in the N frames of pictures included in the acquired video data and the N frames of pictures after the gray processing are not changed, only after the pictures are filled, so that the object recognition model is adapted to the pictures with different sizes.

502. A recorded reference feature mapping set is obtained.

The reference feature mapping set is a convolution set of the processed previous N frames of pictures, and is used for indicating morphological features of the object. In the present application, the morphological feature herein refers to a main feature that can represent the object.

503. And determining the weight of each frame of picture according to the feature mapping set and the reference feature mapping set of each frame of picture.

Specifically, the server performs similarity calculation on the feature mapping set of each frame of picture and the reference feature mapping set, and determines the weight of each frame of picture according to the obtained similarity of each frame of picture.

In a feasible embodiment, the server performs similarity calculation on the feature mapping set of each frame of picture and the reference feature mapping set to obtain each frameAnd determining the weight of each frame of picture according to the similarity corresponding to each frame of picture. Wherein, the feature mapping set of each frame of picture can be a_iThe reference feature mapping set can be represented by b, and the feature mapping set a of each frame picture is represented by a_iRespectively carrying out similarity calculation with the reference feature mapping set b to obtain a feature mapping set a of each frame of picture_iRespectively calculating the similarity with the reference feature mapping set b according to the following formula 1.2:

similarity_i(a_i,b)＝(a_i*b_i)/||a_i1.2 of formula | + | b | |

In a possible embodiment, the feature mapping set a of each frame of picture is obtained_iAfter the similarity with the reference feature mapping set b, the server maps the feature mapping set a of each frame of picture_iRespectively normalizing the similarity of the reference feature mapping set b to obtain the weight of each frame of picture, wherein the weight of each frame of picture is calculated as the following formula 1.3:

wherein softmax () is a normalization function and sum () is a summation function.

504. And determining a target convolution set of the N frames of pictures according to the weight and the feature mapping set of each frame of picture.

Specifically, the server performs weighted average according to the weight of each frame of picture and the feature mapping set to obtain a target convolution set of N frames of pictures.

In a feasible embodiment, the server performs weighting processing on the weight and the feature mapping set of each frame of picture to obtain a total feature mapping set of N frames of pictures, and performs averaging processing on the total feature mapping set of the N frames of pictures to obtain a target convolution set of the N frames of pictures. Wherein the weight of each frame of picture is w_iRepresenting, X for the set of feature maps for each frame of picture_iRepresenting, for the set of target volumesC, then the calculation formula of the target convolution set of the N frames of pictures is as follows 1.4:

C＝sum(w_i*X_i) Formula 1.4

In one possible embodiment, after the server determines the target convolution set for the N-frame picture, the server will update the reference feature mapping set with the target convolution set. In a specific implementation, the server uses the obtained target convolution set as a reference feature mapping set to determine a convolution set corresponding to the next group of N-frame pictures according to the aspect.

In a feasible embodiment, after the server acquires the feature mapping set of each frame of picture in the N frames of pictures included in the acquired video data for the first time, the server needs to determine a target convolution set according to the feature mapping set of each frame of picture in the N frames of pictures and the recorded reference feature mapping set, and since there is no reference feature mapping set at this time, the reference feature mapping set may be preset during initial data training, and may generally be set to 1.

505. And performing data processing on the target convolution set by using a candidate set network to obtain auxiliary data of the object in each frame of picture.

Specifically, the server performs first data processing on the target convolution set by using a candidate set network, and removes background data and redundant candidate frame data of the object in each frame of picture, so as to obtain auxiliary data of the object in each frame of picture, wherein the auxiliary data is used for indicating a corresponding image area of the object in the picture.

506. And carrying out classification regression processing on the auxiliary data of the object and the target convolution set to obtain class indication data and corresponding position data of each object in the N frames of pictures.

Specifically, the server performs secondary processing after obtaining the auxiliary data of the object in each frame of picture, and performs classification regression processing on the auxiliary data of the object and the target convolution set to obtain the category indication data and the corresponding position data of each object included in the N frames of pictures. Through the secondary processing, the object can be more accurately identified.

In the embodiment of the invention, a server acquires a feature mapping set and a recorded reference feature mapping set of each frame of picture in N frames of pictures included in acquired video data, determines the weight of each frame of picture according to the feature mapping set and the reference feature mapping set of each frame of picture, further determines a target convolution set of the N frames of picture according to the weight of each frame of picture and the feature mapping set, performs data processing on the target convolution set by using a candidate set network to obtain auxiliary data of an object in each frame of picture, further performs classification regression processing on the auxiliary data of the object and the target convolution set to obtain class indication data and corresponding position data of each object included in the N frames of pictures, so that the object of the multi-frame of pictures in the video can be identified in parallel, and the efficiency of identifying the object in the video is improved.

Fig. 6 is a schematic structural diagram of an object recognition device according to an embodiment of the present invention. The object recognition apparatus described in this embodiment may be executed by a server, and includes:

an obtaining module 601, configured to obtain a feature mapping set of each frame of an N frame of pictures included in acquired video data, where the feature mapping set is used to indicate features of an object included in each frame of the pictures, and N is an integer greater than or equal to 1;

a determining module 602, configured to determine, according to the feature mapping set of each frame of picture and a recorded reference feature mapping set, a target convolution set of the N frames of pictures and auxiliary data of an object in each frame of picture, where the auxiliary data is used to indicate an image area of the object corresponding to the picture in the picture, the reference feature mapping set is a convolution set of processed previous N frames of pictures, and the reference feature mapping set is used to indicate a morphological feature of the object;

the processing module 603 is configured to perform classification regression processing on the auxiliary data of the object and the target convolution set to obtain category indication data and corresponding position data of each object included in the N frames of pictures.

In a possible embodiment, the determining module 602 is specifically configured to:

acquiring a recorded reference feature mapping set;

determining the weight of each frame of picture according to the feature mapping set of each frame of picture and the reference feature mapping set;

determining a target convolution set of the N frames of pictures according to the weight and the feature mapping set of each frame of picture;

and performing data processing on the target convolution set by using a candidate set network to obtain auxiliary data of the object in each frame of picture.

similarity calculation is carried out on the feature mapping set of each frame of picture and the reference feature mapping set, and the similarity between the feature mapping set of each frame of picture and the reference feature mapping set is obtained;

and determining the weight of each frame of picture according to the corresponding similarity of each frame of picture.

carrying out weighting processing on the weight and the feature mapping set of each frame of picture to obtain a total feature mapping set of the N frames of pictures;

and carrying out averaging processing on the total feature mapping set of the N frames of pictures to obtain a target convolution set of the N frames of pictures.

In a possible embodiment, the processing module 603 is further configured to:

and updating the reference feature mapping set by using the target convolution set.

In a possible embodiment, before obtaining the feature mapping set of each of the N frames of pictures included in the captured video data, the apparatus further includes: a fill module 604, wherein:

the obtaining module 601 is further configured to obtain a maximum length and a maximum width of an N-frame picture included in the acquired video data;

the determining module 602 is further configured to determine a maximum value of the maximum length and the maximum width, so as to obtain a filling value;

the filling module 604 is configured to perform filling processing on the size of each frame of the N frames of pictures according to the filling value.

In a possible embodiment, the filling module 604 is specifically configured to:

for a target picture in the N frames of pictures, calculating a picture scaling multiple corresponding to the target picture according to the length and the width of the target picture, wherein the target picture is any one of the N frames of pictures;

if the product of the larger value of the length and the width of the target picture and the picture scaling factor is larger than the filling value, adjusting the picture scaling factor, and filling the size of the target picture according to the adjusted picture scaling factor;

and if the product of the larger value of the length and the width of the target picture and the picture scaling factor is less than or equal to the filling numerical value, filling the size of the target picture according to the picture scaling factor.

In a possible embodiment, the obtaining module 601 is specifically configured to:

acquiring N frames of pictures included in the acquired video data;

and inputting each frame of the N frames of pictures into a skeleton network for convolution, pooling and activation processing to obtain a feature mapping set of each frame of picture.

It can be understood that the functions of each functional module of the object identification apparatus in this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description in fig. 3 or fig. 5 of the foregoing method embodiment, which is not described herein again.

Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention. The server described in this embodiment includes: a processor 701, a network interface 702, and a memory 703. The processor 701, the network interface 702, and the memory 703 may be connected by a bus or other means, and the embodiment of the present invention is exemplified by being connected by a bus.

The processor 701 (or Central Processing Unit (CPU)) is a computing core and a control core of the server. The network interface 702 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.), controlled by the processor 701 for transceiving data. The Memory 703(Memory) is a Memory device of the server for storing programs and data. It is understood that the memory 703 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, at least one memory device may be located remotely from the processor 701. Memory 703 provides storage space that stores the operating system and executable program code for the server, which may include, but is not limited to: windows system (an operating system), Linux system (an operating system), etc., which are not limited in this regard.

In the embodiment of the present invention, the processor 701 executes the executable program code in the memory 703 to perform the following operations:

In a possible embodiment, the processor 701 is specifically configured to:

acquiring a recorded reference feature mapping set;

In a possible embodiment, the processor 701 is specifically configured to:

In a possible embodiment, the processor 701 is further configured to:

In a possible embodiment, before the processor 701 obtains the feature mapping set of each frame of the N frames of pictures included in the captured video data, it is further configured to:

acquiring the maximum length and the maximum width of N frames of pictures included in the acquired video data;

determining the maximum value of the maximum length and the maximum width to obtain a filling value;

and filling the size of each frame of picture in the N frames of pictures according to the filling value.

In a possible embodiment, the processor 701 is specifically configured to:

acquiring N frames of pictures included in the acquired video data;

In a specific implementation, the processor 701, the network interface 702, and the memory 703 described in the embodiment of the present invention may execute the implementation described in the flow of the object identification method provided in the embodiment of the present invention, and may also execute the implementation described in the object identification apparatus provided in the embodiment of the present invention, which is not described herein again.

Embodiments of the present invention further provide a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, the steps performed in the above-mentioned object recognition embodiments may be performed.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps performed in the above-described object recognition embodiment.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An object identification method, characterized in that the method comprises:

2. The method of claim 1, wherein determining the target convolution set of the N frames of pictures and the auxiliary data of the object in each frame of pictures according to the feature mapping set of each frame of pictures and the recorded reference feature mapping set comprises:

acquiring a recorded reference feature mapping set;

3. The method of claim 2, wherein determining the weight of each frame of picture according to the feature mapping set of each frame of picture and the reference feature mapping set comprises:

4. The method according to claim 2 or 3, wherein the determining the target convolution set of the N frames of pictures according to the weight and feature mapping set of each frame of pictures comprises:

5. The method of claim 1, further comprising:

6. The method of claim 1, wherein prior to obtaining the feature mapping set for each of the N frames of pictures included in the captured video data, the method further comprises:

7. The method according to claim 6, wherein the padding the size of each frame of the N frames of pictures according to the padding value comprises:

8. The method of claim 1, wherein obtaining the feature mapping set for each of the N frames of pictures included in the captured video data comprises:

acquiring N frames of pictures included in the acquired video data;

9. An object recognition apparatus, characterized in that the apparatus comprises:

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method according to any one of claims 1 to 8.