CN112101197A - Face information acquisition method and device - Google Patents

Face information acquisition method and device Download PDF

Info

Publication number
CN112101197A
CN112101197A CN202010963606.0A CN202010963606A CN112101197A CN 112101197 A CN112101197 A CN 112101197A CN 202010963606 A CN202010963606 A CN 202010963606A CN 112101197 A CN112101197 A CN 112101197A
Authority
CN
China
Prior art keywords
target
face
category
scene
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010963606.0A
Other languages
Chinese (zh)
Inventor
王森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN202010963606.0A priority Critical patent/CN112101197A/en
Publication of CN112101197A publication Critical patent/CN112101197A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method and a device for acquiring face information, wherein the method comprises the following steps: extracting facial features from a video frame set of a target video to obtain target facial features corresponding to each video frame in the video frame set; clustering the faces included in the target video according to the video frame set and the target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category; identifying target face information corresponding to the category face features in a face information base, wherein the face information base records the face features and the face information with corresponding relations; and outputting target face information corresponding to the target face category. The method and the device solve the technical problem that the efficiency of identifying the face information from the video is low in the related art.

Description

Face information acquisition method and device
Technical Field
The present application relates to the field of computers, and in particular, to a method and an apparatus for acquiring facial information.
Background
The role of participation is identified from the video, and the position of the role appearing in the video is marked, which is a key technology of video intelligent analysis and has very important meaning in the application field of multimedia (movies, television shows, integrated art and the like). The main identification mode at present is an artificial intelligence-based mode, and the mode obtains picture information of each second of a video by performing frame extraction on the video, then performs face feature extraction through a deep neural network, and finally performs retrieval and identification in a face bottom library to obtain the information of who the face is. The disadvantage of this method is that the face appearing in each second of picture needs to go to the face base for retrieval, which increases the network transmission delay and the database retrieval delay, and the retrieval speed is greatly affected. In addition, for the situation that the same person appears at different time points in different time periods in the video, the angle, the label and the like of the same face may change, so that the extracted face features have differences, and the same person may be identified with different results, thereby affecting the accuracy of retrieval.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The application provides a method and a device for acquiring face information, which aim to at least solve the technical problem of low efficiency of identifying face information from a video in the related art.
According to an aspect of an embodiment of the present application, there is provided a face information acquisition method including:
extracting facial features from a video frame set of a target video to obtain target facial features corresponding to each video frame in the video frame set;
clustering the faces included in the target video according to the video frame set and the target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category;
identifying target face information corresponding to the category face features in a face information base, wherein the face information base records face features and face information with corresponding relations;
and outputting target face information corresponding to the target face category.
Optionally, clustering the faces included in the target video according to the video frame set and the target face feature corresponding to each video frame, and obtaining the target face category and the category face feature corresponding to the target face category includes:
dividing the set of video frames into a plurality of scenes, wherein each scene in the plurality of scenes comprises one video frame or a plurality of continuous video frames in the set of video frames;
clustering target face features included in each scene to obtain a scene face category corresponding to each scene and a scene face feature corresponding to each scene face category;
and clustering the scene face features included in the target video to obtain the target face categories and category face features corresponding to each target face category.
Optionally, dividing the video frame set into a plurality of scenes includes:
determining a first similarity between two consecutive video frames in the set of video frames;
and dividing the two video frames with the first similarity higher than or equal to a first similarity threshold into the same scene, and dividing the two video frames with the first similarity lower than the first similarity threshold into different scenes to obtain the multiple scenes.
Optionally, clustering the target face features included in each scene to obtain a scene face category corresponding to each scene and a scene face feature corresponding to each scene face category includes:
classifying the target facial features included in each scene according to a second similarity between the target facial features included in each scene to obtain scene facial categories in each scene, wherein the second similarity between the target facial features included in each scene facial category is higher than a second similarity threshold;
and fusing the target face features included in each scene face category to obtain the scene face features corresponding to each scene face category.
Optionally, clustering the scene facial features included in the target video to obtain the target face categories and category facial features corresponding to each target face category includes:
classifying the scene facial features included in the target video according to a third similarity between the scene facial features included in each scene facial category to obtain the target facial categories, wherein the third similarity between the scene facial features included in each target facial category is higher than a third similarity threshold;
and fusing the scene facial features included in each target facial category to obtain the category facial features corresponding to each target facial category.
Optionally, outputting the target face information corresponding to each of the target face categories includes:
determining time information of a video frame corresponding to the category facial features included in each target face category in the target video;
outputting the target face category, the target face information, and the time information having the correspondence relationship.
Optionally, extracting facial features from a video frame set of a target video, and obtaining the target facial feature corresponding to each video frame in the video frame set includes:
extracting video frames from the target video according to a target time interval to obtain a video frame set;
performing face recognition on each video frame in the video frame set to obtain a face included in each video frame;
and extracting facial features of the face included in each video frame to obtain a target facial feature corresponding to each video frame in the video frame set.
According to another aspect of the embodiments of the present application, there is also provided an apparatus for acquiring face information, including:
the extraction module is used for extracting facial features from a video frame set of a target video to obtain target facial features corresponding to each video frame in the video frame set;
a clustering module, configured to cluster faces included in the target video according to the video frame set and target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category;
the identification module is used for identifying target face information corresponding to the category face features in a face information base, wherein the face information base records the face features and the face information with corresponding relations;
and the output module is used for outputting the target face information corresponding to the target face category.
According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program which, when executed, performs the above-described method.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above method through the computer program.
In the embodiment of the application, facial features are extracted from a video frame set of a target video to obtain target facial features corresponding to each video frame in the video frame set; clustering the faces included in the target video according to the video frame set and the target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category; identifying target face information corresponding to the category face features in a face information base, wherein the face information base records the face features and the face information with corresponding relations; the method comprises the steps of outputting target face information corresponding to target face types, extracting target face features on each video frame from a video frame set of a target video, clustering faces appearing in the target video by using the extracted target face features to obtain target face types and category face features corresponding to each target face type, identifying corresponding target face information from a face information base according to the category face features to output, identifying target face types of all faces appearing in the target video by clustering the faces appearing in the target video, extracting uniform category face features corresponding to the target face types, using the category face features to search the face information base, reducing the number of times of searching compared with the method that each frame needs to search the face information base, improving the searching speed, and performing information base searching by using the characteristics that the category face features corresponding to each face type represent the whole face type And searching is achieved, the purpose of reducing misjudgment and improving the searching accuracy is achieved, the technical effect of improving the efficiency of identifying the face information from the video is achieved, and the technical problem of low efficiency of identifying the face information from the video in the related technology is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic diagram of a hardware environment of an acquisition method of face information according to an embodiment of the present application;
fig. 2 is a flowchart of an alternative face information acquisition method according to an embodiment of the present application;
fig. 3 is a schematic diagram of an alternative face information acquisition apparatus according to an embodiment of the present application;
fig. 4 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of embodiments of the present application, there is provided an embodiment of a method for obtaining face information.
Alternatively, in the present embodiment, the above-described face information acquisition method may be applied to a hardware environment constituted by the terminal 101 and the server 103 as shown in fig. 1. As shown in fig. 1, a server 103 is connected to a terminal 101 through a network, which may be used to provide services (such as game services, application services, etc.) for the terminal or a client installed on the terminal, and a database may be provided on the server or separately from the server for providing data storage services for the server 103, and the network includes but is not limited to: the terminal 101 is not limited to a PC, a mobile phone, a tablet computer, and the like. The method for acquiring the face information according to the embodiment of the present application may be executed by the server 103, the terminal 101, or both the server 103 and the terminal 101. Here, the terminal 101 may execute the face information acquisition method according to the embodiment of the present application by a client installed thereon.
Fig. 2 is a flowchart of an alternative face information obtaining method according to an embodiment of the present application, and as shown in fig. 2, the method may include the following steps:
step S202, extracting facial features from a video frame set of a target video to obtain target facial features corresponding to each video frame in the video frame set;
step S204, clustering the faces included in the target video according to the video frame set and the target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category;
step S206, identifying target face information corresponding to the category face features in a face information base, wherein the face information base records face features and face information with corresponding relations;
step S208, outputting the target face information corresponding to the target face category.
Extracting target face features on each video frame from the video frame set of the target video through the steps S202 to S208, clustering faces appearing in the target video by using the extracted target face features to obtain target face categories and category face features corresponding to each target face category, identifying corresponding target face information from the face information base according to the category face features and outputting the target face information, identifying the target face categories of each face appearing in the target video by clustering the faces appearing in the target video, extracting uniform category face features corresponding to the target face categories, searching the face information base by using the category face features, reducing the searching times and improving the searching speed compared with the mode that the face information base needs to be searched for each frame, searching the information base by using the category face features corresponding to each face type to represent the characteristics of the whole face type, the method achieves the purposes of reducing misjudgment and improving retrieval accuracy, thereby achieving the technical effect of improving the efficiency of identifying the facial information from the video, and further solving the technical problem of lower efficiency of identifying the facial information from the video in the related technology.
Optionally, in this embodiment, the above-mentioned method for acquiring face information may be applied, but not limited, to a scene in which a character appearing in a video is annotated with information. Roles that appear in the video may include, but are not limited to: personas, avatars, animal personas, and the like. Annotated information may include, but is not limited to: character name, character camp, character age, character gender, and the like.
In the technical solution provided in step S202, the target video may include, but is not limited to, any format or type of multimedia resource, such as: video files in various formats such as drama, movie, anaglyphs, and cartoons, live video streams, relay video streams, and the like.
Optionally, in this embodiment, the video frame set includes video frames extracted from the target video.
Optionally, in the present embodiment, the manner of extracting the facial features may include, but is not limited to, an extraction algorithm of the facial features, a trained facial feature extraction model, and the like.
In the technical solution provided in step S204, the faces included in the target video may be clustered, but not limited to, by means of a clustering algorithm, a clustering model, and the like. Such as: K-Means clustering, mean shift clustering, and the like. The purpose of this clustering process is to distinguish which face or faces each appear in the target video.
Optionally, in this embodiment, after clustering the faces, category face features corresponding to each target face category are obtained, the category face features are used to represent overall features of the target face category, and the category face features may be a certain target face feature of the target face features falling into the target face category, or may be calculated by using several or all of the target face features falling into the target face category.
In the technical solution provided in step S206, the face information base may be, but is not limited to, pre-established, and may be updated, modified, and the like in use according to needs. The face information base records face features and face information with corresponding relations, and the face information can include but is not limited to name, alternative name, gender, birthday, birth date and the like represented by the face features.
In the technical solution provided in step S208, the target face information corresponding to the target face category may be output in the form of a label on the target video, a file, or a list on the application display interface, where the list may be provided for a face information detection function, the detection function detects the accuracy of the target face information, modifies a portion where an error occurs, and displays the modified accurate face information on the target video.
Optionally, in this embodiment, the detection function may be, but is not limited to, performed manually, and target face information corresponding to the target face category is output to the manual collation module. For the target face information corresponding to the output target face type, in order to prevent people or characteristic recognition errors and the like which do not exist in a face information base from occurring in the video, operators perform manual proofreading on an obtained result on a manual proofreading module. If the condition of failing to identify or identifying error occurs, the target face information corresponding to the target face category is manually modified, and the face information base is updated by using the correct information.
As an optional embodiment, clustering the faces included in the target video according to the video frame set and the target facial features corresponding to each video frame, and obtaining the target facial category and the category facial features corresponding to the target facial category includes:
s11, dividing the video frame set into a plurality of scenes, wherein each scene in the plurality of scenes comprises one video frame or a plurality of continuous video frames in the video frame set;
s12, clustering the target face features included in each scene to obtain a scene face category corresponding to each scene and a scene face feature corresponding to each scene face category;
s13, clustering the scene face features included in the target video to obtain the target face categories and category face features corresponding to each target face category.
Optionally, in this embodiment, the clustering of the faces may be performed according to scenes, but is not limited to, firstly performing scene division on the target video, then performing face clustering on each scene, and then performing face clustering on the whole video using the result of the clustering in the scenes, so as to obtain the target face category included in the whole target video.
Optionally, in this embodiment, each scene includes one video frame or a plurality of consecutive video frames, that is, one video frame may constitute one scene, the video frame has a scene jump, and a previous video frame of the video frame belongs to a previous scene. The next video frame that continues from the video frame is also subjected to scene skipping and belongs to the next scene.
Optionally, in this embodiment, the clustering of the faces in the scene and the clustering of the faces in the video frames may be performed by classifying the faces through the clustering of the facial features.
As an alternative embodiment, dividing the set of video frames into a plurality of scenes comprises:
s21, determining a first similarity between two consecutive video frames in the video frame set;
s22, dividing the two video frames with the first similarity higher than or equal to the first similarity threshold into the same scene, and dividing the two video frames with the first similarity lower than the first similarity threshold into different scenes to obtain the multiple scenes.
Optionally, in this embodiment, the target video may be divided into scenes by, but not limited to, comparing the similarity between consecutive video frames.
Alternatively, in the present embodiment, a Color Histogram (Color Histogram) method may be used, but not limited to, to measure the similarity between video frames. In a video clip, if two video frames are in the same scene clip, the colors of the video frames are basically similar and in different scenes, and generally, the colors have a larger difference.
Optionally, in this embodiment, if the first similarity between two consecutive video frames reaches the requirement (is higher than or equal to the first similarity threshold), the two consecutive video frames are divided into the same scene, and if the first similarity between two consecutive video frames does not reach the requirement (is lower than the first similarity threshold), the two consecutive video frames are divided into different scenes, that is, one video frame is divided into the previous scene and the next video frame is divided into the next scene.
In an optional implementation mode, frames of a target video are extracted, one video frame is intercepted every second and marked as imgi, the current time stamp is marked as tsi, a deep neural network model is used for extracting facial features, and the features of a human face in the imgi are extracted and marked as featurei. And comparing the similarity of the video frame imgi +1 of the next second in the extracted video with the imgi by adopting a color histogram method, and if the first similarity between the imgi +1 and the imgi is greater than a first similarity threshold value S (for example, set to 0.6), dividing the imgi +1 and the imgi into the same scene, otherwise, dividing the imgi into different scenes. Through the above processing, the target video will be divided into different scenes, as shown below: t { (T1, T2, Ti... Tn }, where Ti denotes different scenes, as shown below, Ti { (img1, feature1, ts1), (img2, feature2, ts2),... (imgj, feature j, tsj) }, where img1, img2, and imgj are video frames belonging to the same scene, and the similarity between them is greater than the threshold S.
As an optional embodiment, clustering the target facial features included in each scene to obtain a scene facial class corresponding to each scene and a scene facial feature corresponding to each scene facial class includes:
s31, classifying the target facial features included in each scene according to a second similarity between the target facial features included in each scene to obtain a scene facial category in each scene, where the second similarity between the target facial features included in each scene facial category is higher than a second similarity threshold;
and S32, fusing the target face features included in each scene face category to obtain the scene face features corresponding to each scene face category.
Alternatively, in the present embodiment, the target facial features included in each scene are classified by comparison of the second similarity between the target facial features included in each scene, so that the face appearing in each scene is classified into different scene face categories.
Optionally, in this embodiment, the scene facial features corresponding to each scene facial class are obtained by fusing target facial features included in the scene facial class, and the fusing manner may be, but is not limited to, calculating an average value of the target facial features as the scene facial features corresponding to the scene facial class.
Alternatively, in the present embodiment, the second similarity between the target facial features included in each scene may be calculated by, but is not limited to, cosine similarity.
In the above alternative embodiment, the face appearing in each scene may be classified, but is not limited to, by the following process:
the method comprises the following steps: for the data in the above-mentioned Ti scene, for each video frame (imgj, featurej, tsj), if imgj is the first snapshot in the scene, all face types included in the video frame imgj are used as the initial scene face categories in the scene, and featurej is the initial scene face features corresponding to the initial scene face categories.
Step two: and comparing the subsequent video frames (imgj +1, featurej +1, tsj +1) in the scene Ti with all the initial scene face categories in the scene Ti, calculating the cosine similarity between featurej +1 and all the initial scene face characteristics in the scene, and selecting the video frame with the largest cosine similarity as the candidate category.
Step three: if the cosine similarity between imgj +1 and the candidate category is greater than the second similarity threshold (for example, set to 0.8), the candidate category is classified, and the average of the facial features corresponding to all the video frames in the candidate category is recalculated to be used as the new scene facial feature of the candidate category, otherwise, imgj +1 is used as the new scene facial category, and featurej +1 is the scene facial feature of the new scene facial category.
Step four: and repeating the steps until all video frames of the scene Ti are traversed, and obtaining all scene face categories and the scene face features corresponding to each scene face category in the scene.
As an optional embodiment, clustering the scene facial features included in the target video to obtain the target face categories and category facial features corresponding to each target face category includes:
s41, classifying the scene facial features included in the target video according to a third similarity between the scene facial features included in each of the scene facial categories, to obtain the target facial categories, where the third similarity between the scene facial features included in each of the target facial categories is higher than a third similarity threshold;
and S42, fusing the scene facial features included in each target facial category to obtain a category facial feature corresponding to each target facial category.
Optionally, in this embodiment, the scene facial features included in the target video are classified by comparing third similarities between the scene facial features, so that the faces appearing in the respective scenes are classified into different target face classes.
Optionally, in this embodiment, the category facial features corresponding to each target facial category are obtained by fusing the category facial features included in the target facial category, and the fusing may be, but is not limited to, calculating a mean value of the category facial features as the category facial features corresponding to the target facial category.
Alternatively, in the present embodiment, the third similarity between the scene facial features may be, but is not limited to, calculated by cosine similarity.
In the above alternative embodiment, the face appearing in the target video may be classified, but is not limited to, by the following process:
the method comprises the following steps: for the first scene T1 in the target video, the scene face class and corresponding scene face features obtained through the face clustering process within the scene serve as the initial target face class and class face features.
Step two: for the scene face category in the second scene T2, the comparison is performed sequentially with the initial target face category, and the process is similar to the clustering process of the faces in the scene, specifically as follows:
the cosine similarity between the scene facial features in the T2 scene and the current category facial features is sequentially calculated, if the similarity is greater than a third similarity threshold (for example, set to 0.6, because the people in different scenes are not the same person in a large probability, the third similarity threshold may be a little lower than the second similarity threshold), the scene facial feature in the T2 scene is classified into the initial target facial category, and the average of all the scene facial features in the target facial category is recalculated as the new category facial features of the target facial category, otherwise, the scene facial category and the scene facial features in the T2 scene are recalculated as the new target facial category and the new category facial features.
Step three: and repeating the steps until all scenes are traversed to obtain all target face categories and the category face features corresponding to each target face category.
As an alternative embodiment, outputting the target face information corresponding to the target face category includes:
s51, determining the time information of the video frame corresponding to the category facial features included in each target face category in the target video;
s52, the target face category, the target face information, and the time information having the correspondence relationship are output.
Optionally, in this embodiment, the output content may further include, but is not limited to, time information that the target face category appears in the target video, which may be, but is not limited to, determined by a position of the video frame in the target video. For example: the content of the output may be, but is not limited to: the faces appearing in the first set of drama a include: category 1 (actor at a certain time of appearance: 1s, 3s, 15s, 24s), category 2 (actor/singer at a certain time of appearance: 1s, 5s, 15s, 37s), category 3 (actor at a certain time of appearance: 2s, 3s, 6s, 24s), and so on.
As an alternative embodiment, extracting facial features from a video frame set of a target video, and obtaining the target facial feature corresponding to each video frame in the video frame set includes:
s61, extracting video frames from the target video according to the target time interval to obtain the video frame set;
s62, carrying out face recognition on each video frame in the video frame set to obtain a face included in each video frame;
s63, extracting the face feature included in each video frame to obtain the target face feature corresponding to each video frame in the video frame set.
Optionally, in this embodiment, the frequency of video frame extraction (i.e. the target time interval) for the target video may be set according to actual requirements, for example: may be set to 0.1s, 0.5s, 0.73s, 1s, 2s, 5s, etc.
Optionally, in this embodiment, the manner of extracting the video frame may include, but is not limited to, using ffmpeg (fast Forward mpeg) technology.
Optionally, in the present embodiment, the face recognition may be implemented, but not limited to, using a deep neural network model. Facial feature extraction can be automatically performed through the deep neural network model, so that target facial features in the video frame are obtained.
The present application further provides an alternative embodiment, which provides a system for identifying a person in a video, the system comprising: the system comprises a frame extracting module for a sub-scene video, an in-scene face clustering module, a cross-scene face clustering module, a character recognition and retrieval module and the like. Through the processing of the system, the face categories in the video can be counted in a mode of clustering people in the video, and then the face is identified, so that the network transmission in the identification process is reduced, and the identification efficiency is improved.
The system realizes the identification of the face in the video through the following processes: the frame extraction processing is carried out on the video through a sub-scene video frame extraction module, similarity calculation between the video frames is carried out according to a Color Histogram (Color Histogram) of the video frames, and the video frames extracted from the video are divided into different scenes. The video frames in different scenes output by the sub-scene video frame extraction module are subjected to local face clustering through the in-scene face clustering module, and the probability of characters appearing in the same scene is relatively fixed and generally appears at different angles and expressions, so that local clustering can be performed firstly, the data volume of data transmission and retrieval is reduced, and the priori knowledge is utilized in the process, and the accuracy of video character recognition is favorably improved. And summarizing character classifications under different scenes output by the face clustering module in the scene through the cross-scene face clustering module to generate all character classifications and characteristics under the video and the time of each video frame appearing in the video. And searching in the character information base by the character recognition and retrieval module according to all character types and characteristics of the video output by the cross-scene face clustering module, calculating cosine similarity between the characteristics of the character types in the video and all character characteristics of the character information base, and selecting the character with the highest cosine similarity as a label of the character type. And finally outputting the labels of the characters in the video and the position information appearing in the video.
In this optional embodiment, the system may further include: and the manual proofreading module can perform manual proofreading in order to prevent people or people identification errors which do not exist in the people information base from occurring in the video. And the result output by the figure identification retrieval module is collated through the manual collation module, if the condition occurs, the figure type information is manually modified, and the label and the characteristic of the figure are added into the figure information base.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
According to another aspect of the embodiments of the present application, there is also provided an acquisition apparatus of face information for implementing the above-described acquisition method of face information. Fig. 3 is a schematic diagram of an alternative facial information acquisition apparatus according to an embodiment of the present application, and as shown in fig. 3, the apparatus may include:
an extraction module 32, configured to extract facial features from a video frame set of a target video, so as to obtain a target facial feature corresponding to each video frame in the video frame set;
a clustering module 34, configured to cluster faces included in the target video according to the video frame set and target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category;
an identifying module 36, configured to identify target face information corresponding to the category of face features in a face information base, where the face information base records face features and face information having correspondence relationships;
and an output module 38, configured to output target face information corresponding to the target face category.
It should be noted that the extracting module 32 in this embodiment may be configured to execute the step S202 in this embodiment, the clustering module 34 in this embodiment may be configured to execute the step S204 in this embodiment, the identifying module 36 in this embodiment may be configured to execute the step S206 in this embodiment, and the outputting module 38 in this embodiment may be configured to execute the step S208 in this embodiment.
It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as a part of the apparatus may operate in a hardware environment as shown in fig. 1, and may be implemented by software or hardware.
Through the modules, target face features on each video frame are extracted from a video frame set of a target video, the extracted target face features are utilized to cluster faces appearing in the target video to obtain target face categories and category face features corresponding to each target face category, corresponding target face information is identified from a face information base according to the category face features and is output, the target face categories of all faces appearing in the target video are identified through clustering the faces appearing in the target video, unified category face features corresponding to the target face categories are extracted, the category face features are used for searching the face information base, compared with a mode that the face information base needs to be searched in each frame, the searching times are reduced, the searching speed is improved, and the information base is searched by using the characteristics that the category face features corresponding to each face type represent the whole face type, the method achieves the purposes of reducing misjudgment and improving retrieval accuracy, thereby achieving the technical effect of improving the efficiency of identifying the facial information from the video, and further solving the technical problem of lower efficiency of identifying the facial information from the video in the related technology.
As an alternative embodiment, the clustering module includes:
a dividing unit, configured to divide the video frame set into a plurality of scenes, where each scene in the plurality of scenes includes one video frame or a plurality of consecutive video frames in the video frame set;
the first clustering unit is used for clustering the target face features included in each scene to obtain a scene face category corresponding to each scene and a scene face feature corresponding to each scene face category;
and the second clustering unit is used for clustering the scene facial features included in the target video to obtain the target facial categories and the category facial features corresponding to each target facial category.
As an alternative embodiment, the dividing unit is configured to:
determining a first similarity between two consecutive video frames in the set of video frames;
and dividing the two video frames with the first similarity higher than or equal to a first similarity threshold into the same scene, and dividing the two video frames with the first similarity lower than the first similarity threshold into different scenes to obtain the multiple scenes.
As an alternative embodiment, the first clustering unit is configured to:
classifying the target facial features included in each scene according to a second similarity between the target facial features included in each scene to obtain scene facial categories in each scene, wherein the second similarity between the target facial features included in each scene facial category is higher than a second similarity threshold;
and fusing the target face features included in each scene face category to obtain the scene face features corresponding to each scene face category.
As an alternative embodiment, the second classification unit is configured to:
classifying the scene facial features included in the target video according to a third similarity between the scene facial features included in each scene facial category to obtain the target facial categories, wherein the third similarity between the scene facial features included in each target facial category is higher than a third similarity threshold;
and fusing the scene facial features included in each target facial category to obtain the category facial features corresponding to each target facial category.
As an alternative embodiment, the output module comprises:
a determining unit, configured to determine time information of a video frame corresponding to a category facial feature included in each target face category in the target video;
an output unit that outputs the target face category, the target face information, and the time information having the correspondence relationship.
As an alternative embodiment, the extraction module packs:
the extraction unit is used for extracting video frames from the target video according to the target time interval to obtain the video frame set;
the identification unit is used for carrying out face identification on each video frame in the video frame set to obtain a face included in each video frame;
and the extracting unit is used for extracting the facial features of the face included in each video frame to obtain the target facial features corresponding to each video frame in the video frame set.
It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as a part of the apparatus may be operated in a hardware environment as shown in fig. 1, and may be implemented by software, or may be implemented by hardware, where the hardware environment includes a network environment.
According to another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the above-described face information acquisition method.
Fig. 4 is a block diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 4, the electronic device may include: one or more processors 401 (only one of which is shown), a memory 403, and a transmission device 405. as shown in fig. 4, the electronic device may further include an input-output device 407.
The memory 403 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for acquiring face information in the embodiment of the present application, and the processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 403, so as to implement the above-mentioned method for acquiring face information. The memory 403 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 403 may further include memory located remotely from processor 401, which may be connected to an electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmitting device 405 is used for receiving or sending data via a network, and may also be used for data transmission between the processor and the memory. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 405 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 405 is a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
In particular, the memory 403 is used for storing application programs.
The processor 401 may call the application stored in the memory 403 via the transmission means 405 to perform the following steps:
extracting facial features from a video frame set of a target video to obtain target facial features corresponding to each video frame in the video frame set;
clustering the faces included in the target video according to the video frame set and the target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category;
identifying target face information corresponding to the category face features in a face information base, wherein the face information base records face features and face information with corresponding relations;
and outputting target face information corresponding to the target face category.
By adopting the embodiment of the application, a scheme for acquiring the face information is provided. Extracting target face features on each video frame from a video frame set of a target video, clustering faces appearing in the target video by using the extracted target face features to obtain target face categories and category face features corresponding to each target face category, identifying corresponding target face information from a face information base according to the category face features and outputting the corresponding target face information, identifying target face categories of each face appearing in the target video by clustering the faces appearing in the target video, extracting uniform category face features corresponding to the target face categories, searching the face information base by using the category face features, reducing the searching times and improving the searching speed compared with a mode that the face information base needs to be searched for each frame, and searching the information base by using the characteristics that the face features corresponding to each face type represent the whole face type, the method achieves the purposes of reducing misjudgment and improving retrieval accuracy, thereby achieving the technical effect of improving the efficiency of identifying the facial information from the video, and further solving the technical problem of lower efficiency of identifying the facial information from the video in the related technology.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.
It will be understood by those skilled in the art that the structure shown in fig. 4 is merely an illustration, and the electronic device may be a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, etc. Fig. 4 is a diagram illustrating the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 4, or have a different configuration than shown in FIG. 4.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program for instructing device-associated hardware, and the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
Embodiments of the present application also provide a storage medium. Alternatively, in the present embodiment, the above-described storage medium may be used for program codes for executing the acquisition method of face information.
Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the above embodiment.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:
extracting facial features from a video frame set of a target video to obtain target facial features corresponding to each video frame in the video frame set;
clustering the faces included in the target video according to the video frame set and the target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category;
identifying target face information corresponding to the category face features in a face information base, wherein the face information base records face features and face information with corresponding relations;
and outputting target face information corresponding to the target face category.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.
Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method described in the embodiments of the present application.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (10)

1. A method of acquiring face information, comprising:
extracting facial features from a video frame set of a target video to obtain target facial features corresponding to each video frame in the video frame set;
clustering the faces included in the target video according to the video frame set and the target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category;
identifying target face information corresponding to the category face features in a face information base, wherein the face information base records face features and face information with corresponding relations;
and outputting target face information corresponding to the target face category.
2. The method of claim 1, wherein clustering the faces included in the target video according to the set of video frames and the target facial features corresponding to each video frame to obtain the target face category and the category facial features corresponding to the target face category comprises:
dividing the set of video frames into a plurality of scenes, wherein each scene in the plurality of scenes comprises one video frame or a plurality of continuous video frames in the set of video frames;
clustering target face features included in each scene to obtain a scene face category corresponding to each scene and a scene face feature corresponding to each scene face category;
and clustering the scene face features included in the target video to obtain the target face categories and category face features corresponding to each target face category.
3. The method of claim 2, wherein dividing the set of video frames into a plurality of scenes comprises:
determining a first similarity between two consecutive video frames in the set of video frames;
and dividing the two video frames with the first similarity higher than or equal to a first similarity threshold into the same scene, and dividing the two video frames with the first similarity lower than the first similarity threshold into different scenes to obtain the multiple scenes.
4. The method of claim 2, wherein clustering the target facial features included in each scene to obtain the scene facial class corresponding to each scene and the scene facial features corresponding to each scene facial class comprises:
classifying the target facial features included in each scene according to a second similarity between the target facial features included in each scene to obtain scene facial categories in each scene, wherein the second similarity between the target facial features included in each scene facial category is higher than a second similarity threshold;
and fusing the target face features included in each scene face category to obtain the scene face features corresponding to each scene face category.
5. The method of claim 2, wherein clustering the scene facial features included in the target video to obtain the target facial classes and class facial features corresponding to each target facial class comprises:
classifying the scene facial features included in the target video according to a third similarity between the scene facial features included in each scene facial category to obtain the target facial categories, wherein the third similarity between the scene facial features included in each target facial category is higher than a third similarity threshold;
and fusing the scene facial features included in each target facial category to obtain the category facial features corresponding to each target facial category.
6. The method of claim 1, wherein outputting the target face information corresponding to the target face class comprises:
determining time information of a video frame corresponding to the category facial features included in each target face category in the target video;
outputting the target face category, the target face information, and the time information having the correspondence relationship.
7. The method of claim 1, wherein extracting facial features from a set of video frames of a target video, and obtaining the target facial features corresponding to each video frame in the set of video frames comprises:
extracting video frames from the target video according to a target time interval to obtain a video frame set;
performing face recognition on each video frame in the video frame set to obtain a face included in each video frame;
and extracting facial features of the face included in each video frame to obtain a target facial feature corresponding to each video frame in the video frame set.
8. An apparatus for acquiring face information, comprising:
the extraction module is used for extracting facial features from a video frame set of a target video to obtain target facial features corresponding to each video frame in the video frame set;
a clustering module, configured to cluster faces included in the target video according to the video frame set and target face features corresponding to each video frame to obtain a target face category and category face features corresponding to the target face category;
the identification module is used for identifying target face information corresponding to the category face features in a face information base, wherein the face information base records the face features and the face information with corresponding relations;
and the output module is used for outputting the target face information corresponding to the target face category.
9. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program when executed performs the method of any of the preceding claims 1 to 7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the method of any of the preceding claims 1 to 7 by means of the computer program.
CN202010963606.0A 2020-09-14 2020-09-14 Face information acquisition method and device Pending CN112101197A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010963606.0A CN112101197A (en) 2020-09-14 2020-09-14 Face information acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010963606.0A CN112101197A (en) 2020-09-14 2020-09-14 Face information acquisition method and device

Publications (1)

Publication Number Publication Date
CN112101197A true CN112101197A (en) 2020-12-18

Family

ID=73751593

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010963606.0A Pending CN112101197A (en) 2020-09-14 2020-09-14 Face information acquisition method and device

Country Status (1)

Country Link
CN (1) CN112101197A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528923A (en) * 2022-01-25 2022-05-24 山东浪潮科学研究院有限公司 Video target detection method, device, equipment and medium based on time domain context

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528923A (en) * 2022-01-25 2022-05-24 山东浪潮科学研究院有限公司 Video target detection method, device, equipment and medium based on time domain context
CN114528923B (en) * 2022-01-25 2023-09-26 山东浪潮科学研究院有限公司 Video target detection method, device, equipment and medium based on time domain context

Similar Documents

Publication Publication Date Title
US11132555B2 (en) Video detection method, server and storage medium
CN110166827B (en) Video clip determination method and device, storage medium and electronic device
CN110134829B (en) Video positioning method and device, storage medium and electronic device
US11914639B2 (en) Multimedia resource matching method and apparatus, storage medium, and electronic apparatus
WO2020143156A1 (en) Hotspot video annotation processing method and apparatus, computer device and storage medium
US20220172476A1 (en) Video similarity detection method, apparatus, and device
CN112565825A (en) Video data processing method, device, equipment and medium
CN109871490B (en) Media resource matching method and device, storage medium and computer equipment
US20130148898A1 (en) Clustering objects detected in video
EP3295678A1 (en) Entity based temporal segmentation of video streams
CN112291589B (en) Method and device for detecting structure of video file
CN112818149A (en) Face clustering method and device based on space-time trajectory data and storage medium
CN113627402B (en) Image identification method and related device
CN107547922B (en) Information processing method, device, system and computer readable storage medium
CN110147469A (en) A kind of data processing method, equipment and storage medium
CN113515998A (en) Video data processing method and device and readable storage medium
CN111177436A (en) Face feature retrieval method, device and equipment
CN114139015A (en) Video storage method, device, equipment and medium based on key event identification
CN111046209A (en) Image clustering retrieval system
CN111488813A (en) Video emotion marking method and device, electronic equipment and storage medium
CN108024148B (en) Behavior feature-based multimedia file identification method, processing method and device
CN105989063B (en) Video retrieval method and device
CN112101197A (en) Face information acquisition method and device
CN114467125B (en) Capturing artist images from video content using face recognition
CN111354013A (en) Target detection method and device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination