CN110909651A

CN110909651A - Video subject person identification method, device, equipment and readable storage medium

Info

Publication number: CN110909651A
Application number: CN201911122223.4A
Authority: CN
Inventors: 郑茂
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-03-24
Anticipated expiration: 2039-11-15
Also published as: CN110909651B; CN111310731B; CN111310731A

Abstract

The application discloses a method, a device and equipment for identifying a video subject person and a readable storage medium, and relates to the technical field of multimedia. The method comprises the following steps: acquiring n frames of video image frames from a target video, wherein n is more than or equal to 2; carrying out face recognition on n frames of video image frames and figure identity information; carrying out pedestrian detection on the n frames of video image frames to obtain figure features, wherein the figure features comprise a first figure feature matched with figure identity information and a second figure feature not matched with the figure identity information; and re-identifying the second body characteristics according to the first body characteristics, and determining the main character of the video by combining the re-identification result. The second body characteristics which are not matched with the person identity information are re-identified through the first body characteristics matched with the person identity information, the problem that a video subject person cannot be accurately identified in a video image frame is avoided, and the identification accuracy of the video subject person is improved.

Description

Video subject person identification method, device, equipment and readable storage medium

Technical Field

The embodiment of the application relates to the technical field of multimedia, in particular to a method, a device and equipment for identifying a video subject person and a readable storage medium.

Background

Person recognition is a technique for recognizing a person in an image, and is generally applied to recognition of a person in a video. Optionally, the main person of the video (i.e. the person appearing in the key frame the most frequently) can be obtained by performing person identification on the key frames in the video and combining the identification results of all the key frames.

In the related technology, after a key frame is acquired, face detection is performed on the key frame, after an area of a face in the key frame is determined, face features in the area are extracted and recognized, and then an identity corresponding to the face is determined.

However, since the manner in which a subject person appears, such as a back shadow or a side face, usually occurs in a key frame, the face of the subject person cannot be accurately recognized, so that the statistical error of the number of times of occurrence of the actual subject person is large, and the recognition accuracy of the subject person is low.

Disclosure of Invention

The embodiment of the application provides a method, a device and equipment for identifying a video subject figure and a readable storage medium, which can solve the problems that the error of the actual subject figure is large in statistics of the occurrence frequency and the identification accuracy of the subject figure is low. The technical scheme is as follows:

in one aspect, a method for identifying a video subject person is provided, where the method includes:

acquiring n frames of video image frames from a target video, wherein the n frames of video image frames are used for determining the video subject person of the target video, and n is more than or equal to 2;

carrying out face recognition on the n frames of video image frames to obtain character identity information in the n frames of video image frames;

carrying out pedestrian detection on the n frames of video image frames to obtain character body characteristics in the n frames of video image frames, wherein the character body characteristics comprise a first body characteristic matched with the character identity information and a second body characteristic not matched with the character identity information;

and re-identifying the character identity information of the second body characteristic according to the first body characteristic, and determining the video subject character of the target video by combining a re-identification result.

In another aspect, an apparatus for identifying a person as a subject of video is provided, the apparatus comprising:

the acquisition module is used for acquiring n frames of video image frames from a target video, wherein the n frames of video image frames are used for determining the video subject figure of the target video, and n is more than or equal to 2;

the recognition module is used for carrying out face recognition on the n frames of video image frames to obtain the identity information of people in the n frames of video image frames;

the extraction module is used for carrying out pedestrian detection on the n frames of video image frames to obtain character body characteristics in the n frames of video image frames, wherein the character body characteristics comprise a first body characteristic matched with the character identity information and a second body characteristic not matched with the character identity information;

the identification module is further used for re-identifying the character identity information of the second body characteristic according to the first body characteristic;

and the determining module is used for determining the video subject character of the target video by combining the re-recognition result.

In another aspect, a computer device is provided, which includes a processor and a memory, wherein at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the method for identifying a video subject person as described in the embodiments of the present application.

In another aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the method for identifying a video subject person as described in the embodiments of the present application.

In another aspect, a computer program product is provided, which when running on a computer, causes the computer to execute the method for identifying a person in a video subject as described in the embodiments of the present application.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

after the video image frames in the video are subjected to face recognition, pedestrian detection is performed on the video image frames, second body characteristics which are not matched with person identity information are re-recognized through the first body characteristics matched with the person identity information, the problem that the video subject person cannot be accurately recognized in the video image frames when the body area displayed by the video subject person in the video image frames is a side body and a back shadow, so that the recognition accuracy of the video subject person is low is solved, and the recognition accuracy of the video subject person is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a face keypoint identification result provided by an exemplary embodiment of the present application;

fig. 2 is a flowchart of a method for identifying a person in a video according to an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a cascading structure of an MTCNN model provided based on the embodiment shown in FIG. 2;

FIG. 4 is a schematic diagram of constructing an image pyramid from an image provided based on the embodiment shown in FIG. 2;

FIG. 5 is a schematic diagram of a correction to a face region provided based on the embodiment shown in FIG. 2;

FIG. 6 is a schematic diagram of a process for training and testing a face recognition model provided based on the embodiment shown in FIG. 2;

fig. 7 is an overall structural schematic diagram of the CSP detector provided based on the embodiment shown in fig. 2;

fig. 8 is a flowchart of a method for identifying a person in a video according to another exemplary embodiment of the present application;

FIG. 9 is a schematic structural diagram of an HPM model provided based on the embodiment shown in FIG. 8;

fig. 10 is a schematic diagram of a process for determining person identification information corresponding to a person region box through face recognition and pedestrian detection according to the embodiment shown in fig. 8;

fig. 11 is a flowchart of a method for identifying a person of a video subject according to another exemplary embodiment of the present application;

FIG. 12 is a schematic diagram of a video recommendation process provided based on the embodiment shown in FIG. 11;

fig. 13 is an overall architecture diagram of a neural network model applied in the method for identifying a video subject person according to an exemplary embodiment of the present application;

fig. 14 is a block diagram illustrating an apparatus for identifying a person of a video subject according to an exemplary embodiment of the present application;

fig. 15 is a block diagram illustrating an apparatus for identifying a person of a video subject according to another exemplary embodiment of the present application;

fig. 16 is a block diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, terms referred to in the embodiments of the present application are briefly described:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Face detection: the method refers to a technology for detecting a face position in an image, and optionally, in the process of face detection, face key point recognition is performed on the image, and face region cutting is performed according to the face key points obtained through recognition, so that face detection is performed on the image. The face key points refer to identification points of key parts obtained by detection in a face detection process, and optionally, the key parts include parts where five sense organs of the face are located, for example: the human face key points comprise 5 key points, key points of double eyes, key points of a nose and key points of mouth corners on two sides, or the number of the human face key points is 68 points standard and 106 points standard, the key points are respectively marked on the periphery side of the whole outline of the human face, the periphery side of the eyebrows, the periphery side of the nose and the periphery side of the lip in an image to be detected, and the number of the human face key points can be set by designers. Optionally, the face key points may be detected by a key point detection algorithm, such as: a human face feature point training method (SDM), a Convolutional Neural Network (CNN) based keypoint regression method, and the like. Optionally, the face key points may be used in practical applications, such as face beautification, face hanging, three-dimensional reconstruction, face region determination, and the like, for example, referring to fig. 1, a face 110 is included in a face image 100, the face 110 includes eyes 111, a nose 112, and lips 113, and detected key points 120 are correspondingly marked on the eyes 111, the nose 112, and two sides of a mouth corner of the lips 113.

Face recognition: the method includes the steps of identifying identity information of a face in a face area, optionally, extracting features of the face area to be identified in the face identification process, comparing the extracted features with features in a preset face feature library, and determining the identity information of the face in the face area. Optionally, determining a feature of which the similarity between the extracted feature and the face feature library meets the similarity requirement, and taking the identity information corresponding to the feature as the identity information of the face in the face region.

Pedestrian detection: refers to a technology for identifying a person region frame in an image, where optionally, a single person region frame corresponds to one person in the image, and the person region frame includes a complete body part of the person, such as: head, torso, limbs, etc.

And (3) pedestrian re-identification: after face recognition and pedestrian detection are realized, matching the figure region frame with the identity information of the face, and re-recognizing the identity information of the figure region frame which is not matched with the identity information according to the figure region frame matched with the identity information. Optionally, feature extraction is performed on a first person region frame matched with the identity information to obtain a first physique feature, feature extraction is performed on a second person region frame not matched with the identity information to obtain a second physique feature, and the identity information of the second physique feature and the identity information of the second person region frame are re-identified according to the similarity between the first physique feature and the second physique feature.

Secondly, the application scenarios related to the embodiment of the present application include the following scenarios:

in a video recommendation scene, after video image frames are extracted from a video, a subject person in the video image frames is identified, optionally, firstly, the video image frames are subjected to face identification to obtain person identity information corresponding to a face in the video image frames, the video image frames are subjected to pedestrian detection to obtain a person region frame, the person region frame is matched with the person identity information to obtain a first person region frame matched with the person identity information and a second person region frame not matched with the person identity information, the first person region frame is subjected to feature extraction to obtain a first body feature, the second person region frame is subjected to feature extraction to obtain a second body feature, the person identity information corresponding to the second body feature is re-identified according to the similarity between the first body feature and the second body feature, and according to the number of times of occurrence of the person identity information in a re-identification result, and determining the number of occurrences of the person identity information corresponding to the first physical characteristics, determining the main person of the video, and sending a recommendation message to the account, wherein the recommendation message takes the main person as a recommendation focus to recommend the video.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to the technologies such as an artificial intelligence computer vision technology, a machine learning algorithm and the like, and is specifically explained by the following embodiments:

with reference to the above introduction of noun and application scenario, a method for identifying a video subject person provided in an embodiment of the present application is described, and fig. 2 is a flowchart of a method for identifying a video subject person provided in an exemplary embodiment of the present application, which is described by taking the method as an example when applied to a server, as shown in fig. 2, the method includes:

step 201, acquiring n frames of video image frames from a target video, wherein the n frames of video image frames are used for determining a video subject person of the target video, and n is more than or equal to 2.

Optionally, the manner of acquiring n frames of video image frames from the target video includes any one of the following manners:

firstly, acquiring video image frames from the target video at preset time intervals to obtain n video image frames;

schematically, acquiring a frame of video image frame from a video stream of a target video every 1 second, and finally acquiring n frames of video image frames; alternatively, the above example is described by taking 1 second as an example, and the acquisition density of the acquired video image frame may be set by a programmer, which is not limited in this embodiment of the application.

Secondly, acquiring the key frame from the target video to obtain n frames of video image frames.

Optionally, when acquiring the key frames, each frame of key frames may be acquired, or the key frames may be acquired in an interval form, for example: the video stream sequentially comprises a key frame 1, a key frame 2, a key frame 3 and a key frame 4, wherein the key frames are acquired in a frame-by-frame mode, namely the key frame 1 and the key frame 3 are acquired, and the key frame 2 and the key frame 4 are discarded.

Step 202, performing face recognition on the n frames of video image frames to obtain the identity information of the person in the n frames of video image frames.

Optionally, when performing face recognition on the n frames of video image frames, firstly, performing face detection on the n frames of video image frames to obtain a face region in the n frames of video image frames, and after performing face recognition on the face region, obtaining person identity information corresponding to the face region.

That is, the process of confirming the person identity information includes two processes of face detection and face recognition, which are described respectively:

a face detection process

Optionally, in the process of face detection, face key point recognition is performed on the image, and face region segmentation is performed according to the face key points obtained through recognition, so that face detection is performed on the image. In the embodiment of the present application, the face key points include 5 key points, and the five face key points are respectively key points of a pair of eyes, a nose and two corners of a mouth.

Optionally, in this embodiment of the present application, the face detection and the calibration of five key points are implemented by a Multi-taskconfluented neural network (MTCNN) model. The MTCNN model is divided into three stages: in the first stage, a series of candidate windows are rapidly generated through a shallow Convolutional Neural Networks (CNN) suggestion Network (P-Net); in the second stage, a CNN Network optimization Network (Refine Network, R-Net) with stronger capability is used for filtering a non-face candidate window; the third stage marks the face with five keypoints through a more powerful output network (O-Net).

Referring to fig. 3, schematically, the cascade of MTCNN models, which includes a P-Net part 310, an R-Net part 320 and an O-Net part 330, as shown in fig. 3, first gives an image to be face detected, schematically, referring to fig. 4, gives an image 400, adjusts the image 400 to different scales to construct an image pyramid 410, the image pyramid 410 is the image input to the P-Net section 310, where the P-Net section 310 generates a series of face candidate windows and their bounding box regression vectors based on the input 12 x 3 size image, first filters the candidate windows with lower confidence, the coordinates of the candidate windows in the image are calculated through a bounding box regression algorithm, then the candidate windows with high overlapping degree are combined through a non-maximum Suppression (NMS) algorithm, and finally a series of candidate windows are obtained through output.

The size of the candidate window output by the P-Net section 310 is adjusted to 24 × 24 × 3, the adjusted candidate window is input to the R-Net section 320, and the wrong candidate window is further filtered by the R-Net section 320, calibrated using a bounding box regression vector, and filtered using an NMS algorithm for merging duplicate candidate windows.

The size of the candidate window output by the R-Net part 320 is adjusted to 48 × 48 × 3, the adjusted candidate window is output to the O-Net part 330, after the candidate window with low confidence is filtered, the coordinates of the candidate frame in the image are calculated by using the bounding box regression vector, the repeated candidate windows are filtered and merged by using the NMS algorithm, and the face bounding box and the 5 feature point coordinates in the image are obtained.

Optionally, in the training process of the MTCNN model, convergence of three tasks is required as a training purpose, where the three tasks are: 1. face two-classification convergence; 2. the bounding box regression converges; 3. and the positioning of the face feature points is converged. The three tasks are explained below:

1. in the process of convergence of face two classification, convergence is carried out by adopting cross entropy loss, and for each sample x_iAnd calculating the loss value by adopting the following formula I:

the formula I is as follows:

wherein p is_iFor samples x obtained by MTCNN model identification_iThe probability of belonging to a face of a person,

is the sample x_iThe label information of (2) is used for representing the sample x_iWhether it belongs to a human face, and

wherein the content of the first and second substances,

1 indicates the sample x_iBelongs to the field of human face and is characterized by that,

when 0, it represents the sample x_iIs not a human face and does not belong to the human face,

the model parameters of the MTCNN model are adjusted by the calculated loss values, thereby converging the face-two classification.

2. During the convergence of the bounding box regression, the offset between the candidate window and the annotation is calculated for each candidate window, x for each sample_iAnd calculating the square error loss value by adopting the following formula II:

the formula II is as follows:

wherein the content of the first and second substances,

for representing a bounding box obtained by identifying samples through the MTCNN model,

a bounding box for representing the label corresponding to the sample,

and a step of calculating a loss value representing a square error loss value, and adjusting model parameters of the MTCNN model by the square error loss value to converge the bounding box regression.

3. In the process of convergence on the positioning of the face feature points, for each feature point, the distance from the marking point is minimized, and for each sample x_iCalculating the Euclidean distance by adopting the following formula III:

the formula III is as follows:

wherein the content of the first and second substances,

for representing the coordinates of feature points identified by the MTCNN model,

is used for representing the characteristic point coordinate of the corresponding label of the sample,

and the method is used for representing the calculated Euclidean distance, and adjusting the model parameters of the MTCNN model through the Euclidean distance so as to converge the positioning of the human face characteristic points.

Optionally, since the calculated loss values need to be treated differently in the three convergence processes, the model parameters of the MTCNN model are trained through an overall training loss expressed as

Wherein, α_iIndicating task weights, illustratively, in P-Net and R-Net, α_det＝1.0，α_box＝0.5，α_landmark0.5 in O-Net α_det＝1.0，α_box＝1.5，α_landmark＝1.0，

The weight of the sample is represented by a weight,

and represents the loss value calculated by the above formula one to formula three.

Optionally, a face detection frame and 5 face key points are obtained according to the MTCNN model, reflection transformation is performed to correct the face, the effective area of the face is cut, and the size is normalized. As shown in fig. 5, the detected face region 500 labeled with the relevant key point 501 is corrected to obtain a corrected image 510.

Second, face recognition process

Optionally, in the process of face recognition, feature extraction is performed on a face region to be recognized, and after the extracted features are compared with features in a preset face feature library, the person identity information of a face in the face region is determined.

That is, a first face in the face region is extracted, the first face features are compared with second face features in the face feature library to obtain figure identity information corresponding to the face region, wherein the figure identity information is marked on the second face features in the face feature library.

Optionally, the feature extraction for the face region is based on a face recognition model, such as: the ArcFace model uses ResNet-50 as a feature extractor. In the training process of the face recognition model, an additional angle Margin Loss function (additive angular Margin Loss) is constructed for training, and the Euclidean distance or cosine similarity is calculated by using the characteristics extracted by ResNet-50 in the testing stage for face verification. The training test process is illustrated in FIG. 6 and includes a training process 610 and a testing process 620.

In the training process 610, feature extraction is performed on training data 611 through a feature extractor 612 to obtain features x, assuming that the number of sample categories is n, the dimension of input data x is d, the dimension of model weight w is d × n, normalization is performed on the samples x and the weight w, the normalized samples are output through a face recognition model to obtain an l × n-dimensional full-connection layer 613, a loss value 614 is calculated according to the output, and the loss value is multiplied by a normalization parameter s and then calculated through a softmax layer to obtain a classification score 615.

In the testing process 620, after feature extraction is performed on the test data 621, a depth feature 622 is obtained, the depth feature 622 is normalized, and similarity 623 is obtained through cosine similarity algorithm comparison, or a distance value 624 is obtained through euclidean distance algorithm calculation, and then face verification matching is performed 625.

The euclidean distance algorithm refers to the following formula four:

the formula four is as follows:

wherein the content of the first and second substances,

and

feature vectors, L, extracted for two faces involved in the comparison_ijIs the euclidean distance value between two faces.

Please refer to the following formula v for the cosine similarity calculation method:

the formula five is as follows:

wherein the content of the first and second substances,

and

and (3) extracting the feature vectors of the two faces participating in comparison, wherein cos theta is the cosine similarity between the two faces.

And step 203, carrying out pedestrian detection on the n frames of video image frames to obtain character body characteristics in the n frames of video image frames, wherein the character body characteristics comprise a first body characteristic matched with the character identity information and a second body characteristic not matched with the character identity information.

Optionally, when performing pedestrian detection on the n frames of video image frames, firstly obtaining a person region frame in the n frames of video image frames, and performing shape feature extraction on the person region frame to obtain a person shape feature in the n frames of video image frames. Optionally, the human face region and the human figure region frame in the n frames of video image frames are matched to obtain a first matching relationship, and a second matching relationship between the human identity information and the human figure region frame is determined according to the human identity information corresponding to the human face region and the first matching relationship.

Optionally, according to the second matching relationship, for a first person region frame matched with the person identity information, a first physique feature in the first person region frame is extracted, and for a second person region frame not matched with the person identity information, a second physique feature in the second person region frame is extracted.

Optionally, the pedestrian detection process adopts an Anchor-free method, based on a Center and Scale Prediction (CSP) detector of a predicted target, the Center position and the Scale size of the pedestrian are predicted through convolution operation, and the overall structure of the CSP detector is shown in FIG. 7 and mainly divided into two parts, namely a feature extraction part 710 and a detection head 720.

The feature extraction unit 710 divides the convolution layer into five stages by using ResNet-50 as a stock price network and taking each downsampling as a boundary, wherein an output feature map of each stage is the size of the downsampling of the input image by using reduction factors of 2, 4, 8, 16 and 32, and in the fifth stage, the original input image 1/16 size is kept and output by adopting hole convolution. The output feature maps of the 2 nd, 3 rd, 4 th and 5 th stages are subjected to deconvolution to obtain a feature map (1/4 with the size of the original map) with the same resolution, and the feature map of the first stage are merged in the channel direction to obtain a new feature map with richer semantic information. Since the feature maps of each stage have different scales, their standard deviation was changed to 10 using L2 normalization prior to deconvolution and merging.

In the detection head 720, a convolution operation of 3 × 3 is first performed on the feature map extracted by the feature extraction part 710 to reduce the number of channels to 256, then the three parallel 1 × 1 convolution layers are passed to generate a central thermodynamic map and a scale size prediction map respectively, and in order to reduce errors and fine tune the central position, extra offset prediction branches are added to the two parallel branches.

Optionally, in the training of the detector, the loss value calculation is performed by a loss function, and the detector is trained according to the loss value calculation, wherein the loss function comprises three parts, the first part is loss for predicting the position of the central point, the second part is loss for predicting the size of the scale, and the third part is loss for predicting the gravity center offset.

First, the loss of the predicted center point position is regarded as a classification problem when the center point position is predicted. Adding a two-dimensional Gaussian mask by taking the marked point as a center to form a center prediction hotspot graph, wherein the specific calculation mode refers to the following formula six and formula seven:

formula six:

the formula seven:

wherein, K represents the number of target objects in the picture, and (xk, yk, wk, hk) represents the center point coordinates, width and height of the K objects. Variance (σ)_wk ²，σ_hk ²) Proportional to the width and height correspondence of the object. If there is an overlap in the masks, the value is chosen high.

Please refer to the following formula eight for the center point prediction loss function:

the formula eight:

wherein the content of the first and second substances,

and is

Wherein L is_centerIs the central point predicted loss value, p_ij∈[0,1]Is to predict the probability that the current pixel is the center point, y_ijE {0,1} is the true data tag, y_ij1 represents that the pixel point is marked, y_ij0 indicates that the pixel is not marked, M_ijExpressing the value calculated by the above formula six, gamma is a power parameter, w is the width of the k object, h is the height of the k object, and r is the sum of the width and the heightCorresponding proportional parameters.

Secondly, for the loss of the prediction scale size, please refer to the following formula nine:

the formula is nine:

wherein L is_scaleRepresents the loss value, s_k,t_kRepresenting the predicted and true values for each point, respectively.

In combination with the above loss calculation method, the final loss function is the following equation ten:

formula ten: l ═ λ_cL_center+λ_sL_scale+λ_oL_offset

Wherein L is_centerPredicting a loss value, L, for the center point_scaleTo predict the loss value of the scale size, L_offsetLoss values are predicted for the center of gravity offset. Alternatively, λ_cIs set to 0.1, lambda_sIs set to 1, lambda_oSet to 0.1.

And step 204, re-identifying the character identity information of the second body characteristic according to the first body characteristic, and determining the video subject character of the target video according to the re-identification result.

Optionally, the first physical trait feature and the second physical trait feature are compared, when the similarity between the second physical trait feature and the first physical trait feature is greater than the similarity requirement, the personal identity information corresponding to the first physical trait feature and the second physical trait feature is considered to be the same personal identity information, and the personal identity information corresponding to the first physical trait feature is determined to be the personal identity information corresponding to the second physical trait feature (i.e., the second person region frame).

Illustratively, a person 1 is obtained through face recognition in a video image frame a, a person region frame 1 corresponding to the person 1 is obtained through pedestrian detection recognition, the person 1 is not obtained through face recognition in a video image frame b, a person region frame 2 is obtained through pedestrian detection recognition, feature extraction is performed on the person region frame 1 to obtain a first body feature, feature extraction is performed on the person region frame 2 to obtain a second body feature, and when the similarity between the first body feature and the second body feature is greater than the similarity requirement, it is determined that the person region frame 2 in the video image frame b corresponds to the person 1.

Optionally, the first person region frame and the second person region frame are from two different video image frames, and the second person region frame in the video image frame which is not identified to obtain the person identity information is re-identified by identifying the first person region frame in the video image frame which obtains the person identity information.

Optionally, the process of re-identifying the second shape feature is implemented by using a Horizontal Pyramid Matching (HPM) model, and different local spatial information of the pedestrian is fully utilized.

In summary, according to the identification method for the main person in the video, after the video image frames in the video are subjected to face identification, the video image frames are subjected to pedestrian detection, and the person region frames matched with the person identity information are used for re-identifying the person region frames not matched with the person identity information, so that the problem that the identification accuracy of the main person in the video is low due to the fact that the main person in the video image frames cannot be accurately identified in the video image frames when the body regions displayed by the main person in the video image frames are the side body and the back shadow is solved, and the identification accuracy of the main person in the video is improved.

In an optional embodiment, in the process of matching the personal identity information with the person region box, the personal identity information is first matched with the face region, and then the face region is matched with the person region box, fig. 8 is a flowchart of a method for identifying a person as a main video object according to another exemplary embodiment of the present application, and as shown in fig. 8, the method includes:

step 801, acquiring n frames of video image frames from a target video, wherein the n frames of video image frames are used for determining a video subject person of the target video, and n is larger than or equal to 2.

Step 802, performing face detection on n frames of video image frames to obtain a face area in the n frames of video image frames.

Optionally, the face detection process is already described in detail in step 202, and is not described herein again.

And 803, performing face recognition on the face area to obtain the person identity information corresponding to the face area.

Optionally, after the n frames of video image frames are subjected to face detection to obtain a face region, extracting a first face feature in the face region, and comparing the first face feature with a second face feature in a face feature library to obtain person identity information corresponding to the face region, where the second face feature in the face feature library is marked with the person identity information.

Optionally, the comparison process between the first face feature and the second face feature may be performed by calculating an euclidean distance, or may be performed by calculating a cosine similarity, where the closer the euclidean distance is, the more similar the first face feature and the second face feature are; the greater the cosine similarity, the more similar between the first and second facial features.

Optionally, the face recognition process is already described in detail in step 202, and is not described herein again.

And step 804, carrying out pedestrian detection on the n frames of video image frames to obtain a person region frame in the n frames of video image frames.

Optionally, the process of detecting the pedestrian is described in detail in step 203, and is not described herein again.

Step 805, matching the human face region in the n frames of video image frames with the human figure region frame to obtain a first matching relationship.

Optionally, the first matching relationship is obtained according to an overlapping degree relationship between a face region and a person region frame in the n frames of video image frames.

Optionally, a face region corresponding to the person region frame is determined by overlapping the face region and the person region frame, and optionally, the face region corresponding to the person region frame is a face region which is enclosed in the person region frame and is located in a preset position range in the person region frame.

Step 806, determining a second matching relationship between the person identity information and the person region frame according to the person identity information corresponding to the face region and the first matching relationship.

Optionally, a second matching relationship between the person identity information and the person region frame is determined and obtained according to the person identity information obtained by identifying the face region in the step 803 and the first matching relationship between the face region and the person region frame. In step 805, the face area a is matched with the person area frame, and the person area frame matched with the face area a is determined as a person area frame C, so that the person identity relationship corresponding to the person area frame C is determined as a person B.

In step 807, for the first person region frame matched with the person identity information, the first physique feature of the first person region frame is extracted according to the second matching relationship.

Step 808, for the second character area frame not matched with the character identity information, extracting the second body characteristic of the second character area frame.

Optionally, the process of re-identifying the first and second physical features is implemented by using a Horizontal Pyramid Matching (HPM) model, and different local spatial information of the pedestrian is fully utilized.

Optionally, the above-mentioned extraction process for the first and second physical features uses Resnet-50 as a support, and the output features are independently divided into horizontal blocks of different scales, and an average pooling policy and a maximum pooling policy are used. The average pooling strategy can sense the global information of the space bar and is considered by combining with the background context; the goal of the max pooling strategy is to extract the most discriminative information and ignore irrelevant information, such as: background, dressing and the like, combining the multi-scale features to obtain output features, and performing character matching between the first body feature and the second body feature by calculating the distance between the features.

Alternatively, as shown in fig. 9, the structure of the HPM model is that an image 900 is input into a Resnet-50 network to obtain a feature map 910, the feature map 910 is divided into 4 scales (1, 2, 4, 8), the obtained horizontal features are pooled in the horizontal direction by using an average pooling policy and a maximum pooling policy, a local horizontal feature is obtained by weighting, and after a dimensionality reduction operation is performed by using a convolution layer, the local features are used for classification.

Optionally, in the process of training the HPM model, probability prediction is performed by using a softmax activation function, and a prediction result is the probability that a sample corresponds to a real label

The probability of

Please refer to the following formula eleven:

formula eleven:

the HPM model is trained by calculating a cross-entropy loss function as shown in equation twelve below:

equation twelve:

wherein P is the total number of the identity information of the person, W_i,jIs H_i,j(I) Y is a true label, N is a batch size, and CE represents a cross entropy loss function.

Optionally, during testing, 1+2+4+8 256-dimensional local feature vectors are connected to serve as features, original image features and reversed image features are added and normalized, and then detection prediction is performed.

And step 809, re-identifying the character identity information of the second body characteristic according to the first body characteristic, and determining the video subject character of the target video according to a re-identification result.

Optionally, the first physical trait feature and the second physical trait feature are compared, and when the similarity between the first physical trait feature and the second physical trait feature is greater than the requirement of the similarity, the personal identity information corresponding to the first physical trait feature is determined as the personal identity information corresponding to the second physical trait feature.

Optionally, when the video subject person is determined, the number of occurrences of the person identity information recognized in the n frames of video image frames is determined according to the person identity information corresponding to the first form characteristic and the person identity information corresponding to the re-recognition result, and the person identity information with the largest number of occurrences is used as the video subject person of the target video. Optionally, when m video subject persons are included in the target video, the m person identity information with the largest number of occurrences is used as the m video subject persons of the target video, where m is a positive integer.

Referring to fig. 10, schematically, the person library includes face features of a person 1010, a face region 1030 of the person 1010 is identified after a face of a video frame 1020 is identified, and a person region box 1040 corresponding to the face region 1030 is identified, when the image frame 1050 and the image frame 1060 are identified, a person region box 1051 and a person region box 1061 are identified, after feature extraction is performed on the person region box 1051 and the person region box 1061, the extracted features are compared with the features extracted from the person region box 1040 in the video frame 1020, so that the person corresponding to the person region box 1051 and the person region box 1061 is determined to be the person 1010.

In the method provided by the embodiment, the matching relationship between the human face area and the human figure area frame and the human figure identity information corresponding to the human face area are determined, so that the human figure identity information corresponding to the human figure area frame is determined, and the human figure area frame marked with the human figure identity information is re-identified according to the human figure area frame marked with the human figure identity information, so that the identification accuracy of the human figure of the main video is improved.

In an alternative embodiment, the method for identifying a video subject person is applied to an application scene of video recommendation, and fig. 11 is a flowchart of a method for identifying a video subject person according to another exemplary embodiment of the present application, which is described by taking an example that the method is applied to a server, and as shown in fig. 11, the method includes:

step 1101, acquiring n frames of video image frames from the target video, wherein the n frames of video image frames are used for determining a video main person of the target video, and n is larger than or equal to 2.

Step 1102, performing face recognition on n frames of video image frames to obtain the identity information of people in the n frames of video image frames.

Step 1103, performing pedestrian detection on the n frames of video image frames to obtain character physical characteristics in the n frames of video image frames, where the character physical characteristics include a first physical characteristic matched with the character identity information and a second physical characteristic not matched with the character identity information.

Optionally, matching the person identity information in the n frames of video image frames with the person region frame in the n frames of video image frames, extracting a first physique feature in the first person region frame for a first person region frame matched with the person identity information, and extracting a second physique feature in a second person region frame not matched with the person identity information.

And 1104, re-identifying the character identity information of the second body characteristic according to the first body characteristic, and determining the video subject character of the target video according to the re-identification result.

Step 1105, generating a recommendation message according to the video subject person, wherein the recommendation message is used for recommending the target video.

Step 1106, determining that the interest representation includes a target account of the subject character of the video, wherein the interest representation is generated according to the video viewing record of the target account.

Optionally, the recommendation message is used to recommend the video subject person as a main recommendation focus to the target account.

Optionally, the image of interest of the target account is generated from a video viewing record of the target account. Optionally, each video published on the video publishing platform corresponds to at least one video tag, the video tags corresponding to the videos watched by the target account are recorded according to the watching records of the target account on the videos, and the interest portrait of the target account is determined according to the number of times that each tag is recorded. Optionally, when the watching duration of the video by the target account reaches a preset duration, recording the video tag of the video; or, when the watching time length of the target account to the video reaches the preset proportion of the total time length of the video, recording the video label of the video.

Illustratively, most of videos watched by the target account are marked with a video tag character 1, so that the interest portrait of the target account includes the character 1, and when the video subject character of the target video includes the character 1, a recommendation message of the target video is sent to the target account.

Step 1107, send recommendation message to the target account.

Schematically, referring to fig. 12, the overall process is described by taking the above target video as an example, and as shown in fig. 12, after performing face recognition 1220 and pedestrian detection 1230 on an original short video 1210, performing pedestrian re-recognition 1240 to obtain a video subject person 1250 through recognition, where the original short video 1210 is further labeled with a short video classification 1260, and a short video recommendation result 1290 is obtained through a recommendation system 1280 in combination with a user interest image 1270 of a user.

Schematically, please refer to fig. 13, fig. 13 is an overall architecture diagram of a neural network model applied in the method for identifying a video subject person according to an exemplary embodiment of the present application, as shown in fig. 13, a target video 1301 is first acquired, a video frame 1302 is extracted from the target video 1301, in the face detection and recognition system 1310, the MTCNN model 1311 is used to detect the face regions 1312 and the face key points 1313, and after correction, a corrected face image 1314 is obtained, and the person identity information 1316 is obtained by combining with the face library 1315, in the pedestrian detection and re-recognition system 1320, the detection of the person region box 1322 is performed by the CSP detector 1321, matching the person region frame 1322 with the corrected face image 1314 to obtain a person region frame 1323 matched with and not matched with the person identity information, the personal identification information is re-identified by the HPM model 1324, and the main video character 1330 is finally obtained.

Fig. 14 is a block diagram of a device for identifying a person as a subject of video according to an exemplary embodiment of the present application, where as shown in fig. 14, the device includes: an acquisition module 1410, a recognition module 1420, an extraction module 1430, and a determination module 1440;

an obtaining module 1410, configured to obtain n video image frames from a target video, where the n video image frames are used to determine the video subject person of the target video, and n is greater than or equal to 2;

the identification module 1420 is configured to perform face identification on the n frames of video image frames to obtain identity information of people in the n frames of video image frames;

an extraction module 1430, configured to perform pedestrian detection on the n frames of video image frames to obtain character shape features in the n frames of video image frames, where the character shape features include a first shape feature matched with the character identity information and a second shape feature not matched with the character identity information;

the identification module 1420 is further configured to perform re-identification of the person identity information on the second form feature according to the first form feature;

a determining module 1440, configured to determine the video subject person of the target video according to the re-recognition result.

In an alternative embodiment, the identifying module 1420 is further configured to compare the first physical trait and the second physical trait; and when the similarity between the first form feature and the second form feature is greater than the similarity requirement, determining the identity information of the person corresponding to the first form feature as the identity information of the person corresponding to the second form feature.

In an optional embodiment, the determining module 1440 is further configured to determine, according to the personal identity information corresponding to the first physical characteristic and the personal identity information corresponding to the re-recognition result, the number of occurrences that the personal identity information is recognized in the n frames of video image frames;

the determining module 1440 is further configured to use the person identification information with the largest number of occurrences as the video subject person of the target video.

In an optional embodiment, the recognition module 1420 is further configured to perform face detection on the n frames of video image frames to obtain a face region in the n frames of video image frames; and carrying out the face recognition on the face area to obtain the figure identity information corresponding to the face area.

In an alternative embodiment, as shown in fig. 15, the extracting module 1430 further includes:

an extracting unit 1432, configured to extract a first face feature in the face region;

the matching unit 1431 is further configured to compare the first face feature with a second face feature in a face feature library to obtain the person identity information corresponding to the face region, where the person identity information is marked on the second face feature in the face feature library.

In an optional embodiment, the extraction module 1430 is further configured to perform the pedestrian detection on the n frames of video image frames to obtain a human region frame in the n frames of video image frames; and extracting the figure feature of the figure region frame to obtain the figure feature of the n frames of video image frames.

In an optional embodiment, the matching unit 1431 is further configured to match the face region in the n frames of video image frames with the person region frame to obtain a first matching relationship;

the matching unit 1431 is further configured to determine a second matching relationship between the person identity information and the person region frame according to the person identity information corresponding to the face region and the first matching relationship.

In an optional embodiment, the extracting unit 1432 is configured to extract, according to the second matching relationship, a first form feature in the first person region frame for the first person region frame matched with the person identity information; and extracting second figure characteristics in the second person region frame which is not matched with the person identity information.

In an optional embodiment, the matching unit 1431 is further configured to obtain the first matching relationship according to an overlapping degree relationship between the face region and the person region frame in the n frames of video image frames.

In an optional embodiment, the obtaining module 1410 is further configured to obtain the video image frames from the target video at preset time intervals, so as to obtain the n video image frames.

In an optional embodiment, the determining module 1440 is further configured to generate a recommendation message according to the video subject person, where the recommendation message is used to recommend the target video; determining a target account of the video subject character in the interest portrait, wherein the interest portrait is generated according to a video watching record of the target account;

the device, still include:

a sending module 1450, configured to send the recommendation message to the target account.

In summary, the identification apparatus for the main person in the video according to this embodiment performs face identification on the video image frames in the video, and then performs pedestrian detection on the video image frames, and performs re-identification on the person region frames that are not matched with the person identity information through the person region frames that are matched with the person identity information, so as to avoid the problem that the identification accuracy of the main person in the video is low because the main person in the video image frames cannot be accurately identified in the video image frames when the body regions displayed in the video image frames by the main person in the video image frames are the side body and the back shadow, and improve the identification accuracy of the main person in the video.

It should be noted that: the device for identifying a person as a video subject provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the video subject person identification device provided in the above embodiment and the video subject person identification method embodiment belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiment and are not described herein again.

Fig. 16 shows a schematic structural diagram of a server provided in an exemplary embodiment of the present application. Specifically, the method comprises the following steps:

the server 1600 includes a Central Processing Unit (CPU) 1601, a system Memory 1604 including a Random Access Memory (RAM) 1602 and a Read Only Memory (ROM) 1603, and a system bus 1605 connecting the system Memory 1604 and the Central Processing Unit 1601. The server 1600 also includes a basic Input/Output System (I/O System)1606, which facilitates information transfer between various devices within the computer, and a mass storage device 1607 for storing an operating System 1613, application programs 1614, and other program modules 1615.

The basic input/output system 1606 includes a display 1608 for displaying information and an input device 1609 such as a mouse, keyboard, etc. for user input of information. Wherein a display 1608 and an input device 1609 are connected to the central processing unit 1601 by way of an input-output controller 1610 which is connected to the system bus 1605. The basic input/output system 1606 may also include an input-output controller 1610 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1610 may also provide output to a display screen, a printer, or other type of output device.

The mass storage device 1607 is connected to the central processing unit 1601 by a mass storage controller (not shown) connected to the system bus 1605. The mass storage device 1607 and its associated computer-readable media provide non-volatile storage for the server 1600. That is, the mass storage device 1607 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1604 and mass storage device 1607 described above may be collectively referred to as memory.

According to various embodiments of the application, the server 1600 may also operate with remote computers connected to a network, such as the Internet. That is, the server 1600 may be connected to the network 1612 through the network interface unit 1611 that is coupled to the system bus 1605, or the network interface unit 1611 may be used to connect to other types of networks or remote computer systems (not shown).

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Embodiments of the present application further provide a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method for identifying a video subject person provided in the foregoing method embodiments.

Embodiments of the present application further provide a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored on the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the method for identifying a video subject person provided in the foregoing method embodiments.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, which may be a computer readable storage medium contained in a memory of the above embodiments; or it may be a separate computer-readable storage medium not incorporated in the terminal. The computer readable storage medium has stored therein at least one instruction, at least one program, a set of codes, or a set of instructions that are loaded and executed by the processor to implement the method for identifying a subject person of a video provided in an embodiment of the present application.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for identifying a video subject person, the method comprising:

2. The method of claim 1, wherein the re-identifying the identity information of the person from the first physical characteristics of the second physical characteristics comprises:

comparing the first physical characteristic and the second physical characteristic;

and when the similarity between the first form feature and the second form feature is greater than the similarity requirement, determining the identity information of the person corresponding to the first form feature as the identity information of the person corresponding to the second form feature.

3. The method of claim 2, wherein the determining the video subject person of the target video in combination with the re-recognition result comprises:

determining the occurrence frequency of the person identity information recognized in the n frames of video image frames according to the person identity information corresponding to the first physical characteristics and the person identity information corresponding to the re-recognition result;

and taking the person identity information with the largest occurrence number as the video subject person of the target video.

4. The method according to any one of claims 1 to 3, wherein the performing face recognition on the n video image frames to obtain the identity information of the person in the n video image frames comprises:

carrying out face detection on the n frames of video image frames to obtain a face area in the n frames of video image frames;

and carrying out the face recognition on the face area to obtain the figure identity information corresponding to the face area.

5. The method of claim 4, wherein the performing the face recognition on the face region to obtain the identity information of the person corresponding to the face region comprises:

extracting first face features in the face region;

and comparing the first face features with second face features in a face feature library to obtain the figure identity information corresponding to the face region, wherein the figure identity information is marked on the second face features in the face feature library.

6. The method of claim 4, wherein the performing pedestrian detection on the n frames of video image frames to obtain the human figure characteristics in the n frames of video image frames further comprises:

carrying out the pedestrian detection on the n frames of video image frames to obtain a character region frame in the n frames of video image frames;

and extracting the figure feature of the figure region frame to obtain the figure feature of the n frames of video image frames.

7. The method of claim 6, wherein before the extracting the body feature of the human figure region frame, the method further comprises:

matching the human face area in the n frames of video image frames with the human figure area frame to obtain a first matching relation;

and determining a second matching relation between the person identity information and the person region frame according to the person identity information corresponding to the face region and the first matching relation.

8. The method of claim 7, wherein the extracting the body feature of the human figure region frame comprises:

according to the second matching relation, extracting the first body characteristic in the first person area frame for the first person area frame matched with the person identity information;

and for a second character area frame which is not matched with the character identity information, extracting the second body characteristics in the second character area frame.

9. The method of claim 7, wherein the matching the face region and the person region box in the n frames of video image frames to obtain a first matching relationship comprises:

and obtaining the first matching relation according to the overlapping degree relation between the human face area and the human figure area frame in the n frames of video image frames.

10. The method according to any one of claims 1 to 3, wherein said obtaining n video image frames from the target video comprises:

and acquiring the video image frames from the target video at preset time intervals to obtain the n frames of video image frames.

11. The method according to any one of claims 1 to 3, wherein after determining the video subject character of the target video in combination with the re-recognition result, the method further comprises:

generating a recommendation message according to the video subject person, wherein the recommendation message is used for recommending the target video;

determining a target account of the video subject character in the interest portrait, wherein the interest portrait is generated according to a video watching record of the target account;

and sending the recommendation message to the target account.

12. An apparatus for identifying a person as a subject of a video, the apparatus comprising:

13. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the apparatus for identifying a subject person of video according to any one of claims 1 to 11.

14. A computer-readable storage medium, having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the apparatus for identifying a subject person of video according to any one of claims 1 to 11.