CN110532433B

CN110532433B - Entity identification method and device for video scene, electronic equipment and medium

Info

Publication number: CN110532433B
Application number: CN201910828965.2A
Authority: CN
Inventors: 王述; 任可欣; 冯知凡; 张扬; 朱勇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2023-07-25
Anticipated expiration: 2039-09-03
Also published as: CN110532433A

Abstract

The application discloses a method, a device, electronic equipment and a medium for identifying entities of video scenes, and relates to the field of artificial intelligence. The specific implementation scheme is as follows: acquiring a target video to be processed and at least one target mode; extracting at least one target modal feature of the target video; determining a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by a server according to at least one target modality; at least two candidate entity recognition algorithms are deployed in the server; and invoking a target entity recognition algorithm to perform entity recognition on at least one target modal feature so as to obtain a target entity included in the target video. The method and the device realize the identification of the entity of different modes of the target video, have high accuracy of the identification result, can meet different service requirements, and have strong universality.

Description

Entity identification method and device for video scene, electronic equipment and medium

Technical Field

The embodiment of the application relates to computer technology, in particular to artificial intelligence technology, and specifically relates to an entity identification method, an entity identification device, electronic equipment and a medium for designing a video scene.

Background

With the development of information technology and the increasing explosion of various video apps, videos become the most important information transmission mode and are widely applied to various aspects of interpersonal communication, social life and industrial production. For massive video contents, the manual processing cannot be completed, so that the intelligent understanding of the video contents is realized through computer technology, and the automatic and intelligent video classification and labeling are further required.

The conventional method is to perform entity recognition on a target video in a single-mode manner, for example, performing entity recognition through simple video text features or video image features. However, the entity identified by the method cannot cover the entity of the target video more completely, the accuracy is lower, and the entity identification technology is single, so that different service requirements cannot be met.

Disclosure of Invention

The embodiment of the application provides a method, a device, electronic equipment and a medium for identifying entities of a video scene, which can solve the problems that the existing method is low in entity identification accuracy and cannot meet different service requirements.

In a first aspect, an embodiment of the present application provides a method for identifying an entity of a video scene, where the method includes:

Acquiring a target video to be processed and at least one target mode;

extracting at least one target modal feature of the target video;

determining a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by a server according to the at least one target modality; wherein the at least two candidate entity recognition algorithms are both deployed in the server;

and invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature so as to obtain a target entity included in the target video.

One embodiment of the above application has the following advantages or benefits: according to the target modal characteristics of the target video, the corresponding target entity identification algorithm is called from the pre-configured candidate entity identification algorithm, and entity identification is carried out on the target modal characteristics, so that the identification of different modal entities of the target video is realized, the accuracy of the identification result is high, different business requirements can be met, and the universality is strong.

Optionally, determining, according to the at least one target modality, a target entity identification algorithm to be used from at least two candidate entity identification algorithms provided by the server, including:

If the target mode comprises a text mode, determining a target text entity recognition algorithm to be used from the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server;

if the target mode comprises a visual mode, determining a target visual entity recognition algorithm to be used from visual entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server;

if the target modality comprises an audio modality, determining a target audio entity recognition algorithm to be used from audio entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server.

One embodiment of the above application has the following advantages or benefits: by providing corresponding entity recognition algorithm classes for the text mode, the visual mode and the audio mode respectively, the user can call the corresponding entity recognition algorithm classes according to the needs, different service requirements can be met, and the universality of the entity recognition algorithm is improved.

Optionally, if the target modality includes a text modality, determining a target text entity recognition algorithm to be used from a text entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server, including:

If the target mode comprises a text mode, a video domain classification algorithm provided by a server is called, and the target domain to which the target video belongs is determined; the video domain classification algorithm is deployed in the server;

if the target field belongs to a preset field set, determining a knowledge importance entity recognition algorithm as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server;

if the target domain does not belong to the preset domain set, determining a text entity algorithm based on correlation or a text entity algorithm based on sequence as a target text entity recognition algorithm to be used from the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server.

One embodiment of the above application has the following advantages or benefits: the accuracy of the entity identification in different fields is further improved by providing a knowledge importance entity identification algorithm for target videos belonging to a preset field set and providing a text entity algorithm based on correlation or a text entity algorithm based on sequence for other videos.

Optionally, if the target text entity recognition algorithm is a knowledge importance entity recognition algorithm, invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature, including:

invoking a named entity recognition algorithm provided by the server to recognize candidate entities included in the text modal characteristics; the named entity recognition algorithm is deployed in the server;

determining a target entity category associated with the target domain according to a mapping relation between a preset domain and a preset entity category in the knowledge importance entity recognition algorithm;

and taking the candidate entity belonging to the target entity category as a target entity in the target video.

One embodiment of the above application has the following advantages or benefits: through the mapping relation between the preset domain and the preset entity category configured in the knowledge importance entity identification algorithm, fine-grained entity identification is provided for identifying different categories of entities in the target videos belonging to different preset domains.

Optionally, after taking the candidate entity belonging to the target entity class as the target entity in the target video, the method further includes:

And calling an entity chain algorithm provided by a server, and establishing a link between the target entity and an entity in the knowledge graph.

One embodiment of the above application has the following advantages or benefits: by establishing a link relation between the target entity and the knowledge graph, the target entity can be understood and expanded deeply through the knowledge graph.

Optionally, determining the relevance-based text entity algorithm or the sequence-based text entity algorithm as the target text entity recognition algorithm to be used from the text entity recognition algorithm class of at least two candidate entity recognition algorithms provided by the server, including:

acquiring the required target entity precision;

if the target entity precision is the first precision, determining a text entity algorithm based on correlation as a target text entity recognition algorithm to be used from the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server;

if the target entity precision is the second precision, determining a sequence-based text entity algorithm as a target text entity recognition algorithm to be used from the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; wherein the first precision is lower than the second precision.

One embodiment of the above application has the following advantages or benefits: according to different precision requirements of users, different corresponding entity identification algorithms are provided for the users, so that the processing efficiency and the entity quality can be considered, and the resource waste is avoided.

Optionally, if the target modality includes a text modality, before invoking a text entity recognition algorithm to perform entity recognition on the at least one target modality feature, the method further includes:

invoking a text quality model provided by a server to determine the text quality of the text modal feature;

and screening the text modal characteristics according to the determination result.

One embodiment of the above application has the following advantages or benefits: and the entity recognition efficiency is improved by screening the text modal characteristics in advance.

Optionally, if the target modality includes a visual modality, determining a target visual entity recognition algorithm to be used from a visual entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server, including:

if the target mode comprises a visual mode, a target visual category focused by a user is also acquired;

if the target visual category is a human face, determining the human face entity recognition algorithm as a target visual entity recognition algorithm to be used from visual entity recognition algorithm categories of at least two candidate entity recognition algorithms provided by a server;

If the target visual category is an object, determining the visual object recognition algorithm as a target visual object recognition algorithm to be used from visual object recognition algorithm categories of at least two candidate entity recognition algorithms provided by a server;

and if the target visual category is the visual fingerprint, determining the visual fingerprint identification algorithm as a target visual entity identification algorithm to be used from visual entity identification algorithm categories of at least two candidate entity identification algorithms provided by the server.

One embodiment of the above application has the following advantages or benefits: by providing a face entity recognition algorithm, a visual object recognition algorithm or a visual fingerprint recognition algorithm for the video mode, entities with different visual categories can be recognized, and the categories of the visual entities are enriched.

Optionally, after invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature, the method further includes:

and fusing the identified at least two entity information.

One embodiment of the above application has the following advantages or benefits: and the aggregation degree of the entity is improved through the entity information fusion operation.

And adjusting the confidence level of the entity according to the modal source information of the entity and/or the correlation of the entity in the knowledge graph.

One embodiment of the above application has the following advantages or benefits: the marked entity result has higher accuracy by combining the multi-mode information to perform multi-mode fusion and confidence verification.

In a second aspect, an embodiment of the present application further discloses an apparatus for entity identification of a video scene, where the apparatus includes:

the target mode acquisition module is used for acquiring a target video to be processed and at least one target mode;

the target modal feature extraction module is used for extracting at least one target modal feature of the target video;

a target entity recognition algorithm determining module, configured to determine a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by a server according to the at least one target modality; wherein the at least two candidate entity recognition algorithms are both deployed in the server;

and the entity recognition module is used for calling the target entity recognition algorithm to perform entity recognition on the at least one target modal feature so as to obtain a target entity included in the target video.

In a third aspect, an embodiment of the present application further discloses an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods described in any of the embodiments of the present application.

In a fourth aspect, embodiments of the present application also disclose a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any of the embodiments of the present application.

Other effects of the above alternative will be described below in connection with specific embodiments.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

fig. 1 is a flow chart of a method for entity identification of a video scene according to a first embodiment of the present application;

fig. 2 is a flow chart of a method for entity identification of a video scene according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of an entity recognition apparatus for video scenes according to a third embodiment of the present application;

Fig. 4 is a block diagram of an electronic device for implementing the entity identification method for video scenes of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Example 1

Fig. 1 is a flowchart of a method for identifying entities in a video scene according to an embodiment of the present application. The embodiment is suitable for identifying the entity included in the video to be processed, and the method can be executed by the entity identifying device of the video scene provided in the third embodiment of the application, and the device can be implemented in a software and/or hardware mode. As shown in fig. 1, the method may include:

s101, acquiring a target video to be processed and at least one target mode.

The formats of the target video include, but are not limited to, AVI, FLV, RMVB and WMV formats, and the target video is uploaded to the server by the user through the client. The target mode shows which information of the target video the user wants to perform entity recognition, for example, the user wants to perform entity recognition on text information of the target video, and the target mode at the moment is the text mode; for another example, if the user wants to perform entity recognition on the visual information of the target video, the target mode at this time is a visual mode; for another example, if the user wants to identify the entity of the audio information of the target video, the target mode at this time is an audio mode. The user selects a target mode through the client according to different service requirements.

In this embodiment, the target modalities include, but are not limited to, at least one of a text modality, a visual modality, and an audio modality.

And acquiring a target video to be processed and at least one target mode, and calling a corresponding target entity identification algorithm to identify the target mode characteristics according to the target mode, so as to obtain a target entity included in the target video, thereby laying a data foundation.

S102, extracting at least one target modal characteristic of the target video.

Specifically, if the target mode comprises a text mode, extracting text mode characteristics of the target video; if the target mode comprises a visual mode, extracting the visual mode characteristics of the target video; and if the target mode comprises an audio mode, extracting the audio mode characteristics of the target video.

Optionally, the extracting of the text modal feature includes: 1) The caption information of the target video usually appears in the image of the target video, by extracting all frame images or key frame images of the target video, then recognizing the caption information in the extracted frame images by OCR (Optical Character Recognition ), and taking the recognized caption information as the text modal characteristics of the target video. 2) The title and description text of the target video are summaries of the content of the target video by the author, and are directly taken as text modal characteristics of the target video because the title and description text are presented in a text form. 3) The audio information of the target video usually appears on a plurality of time nodes of the target video, all the audio information of the target video is extracted, text information corresponding to the audio information is identified by using ASR (Automatic Speech Recognition ), and finally the ASR identification result is used as the text modal feature of the target video. 4) The author tag text of the target video is obtained by marking the author of the target video with the tag text which is understood by the author of the target video to help the user to understand, and the author tag text of the target video is directly used as the text modal characteristic of the target video because the author tag text is presented in a text form.

Optionally, the extracting of the visual modality features includes: extracting each frame of image of the target video by using a video processing tool as a visual mode characteristic; or extracting the key frame image of the target video as the visual mode characteristic according to the actual service requirement.

Optionally, the extracting of the audio modality features includes: using an audio processing tool to pick up all audio data of the target video from the starting time node to the ending time node as audio mode characteristics; or according to the actual service requirement, the audio data in the target time range of the target video is picked up to be used as the audio mode characteristic.

By extracting at least one target modal feature of the target video, a data foundation is laid for carrying out entity recognition on the target modal feature by subsequently calling a target entity recognition algorithm.

S103, determining a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by a server according to the at least one target modality; wherein the at least two candidate entity recognition algorithms are both deployed in the server.

Wherein the server provides at least two candidate entity algorithms for entity recognition of different target modal features, e.g. the server provides a text entity recognition algorithm class and a visual entity recognition algorithm class; for another example, the server provides a text entity recognition algorithm class and an audio entity recognition algorithm class. And each class of entity algorithms further comprises at least one target entity identification algorithm.

Specifically, according to the target modality, a target entity recognition algorithm is determined among different types of algorithms in at least two candidate entity recognition algorithms provided by the server.

Optionally, if the target modality includes a text modality, determining a target text entity recognition algorithm to be used from among the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server; if the target mode comprises a visual mode, determining a target visual entity recognition algorithm to be used from visual entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; if the target modality comprises an audio modality, determining a target audio entity recognition algorithm to be used from audio entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server.

And 1) if the target mode comprises a text mode, determining a corresponding target text entity recognition algorithm according to the field to which the target video belongs.

Optionally, if the target mode includes a text mode, invoking a video domain classification algorithm provided by a server to determine a target domain to which the target video belongs; the video domain classification algorithm is deployed in the server; if the target field belongs to a preset field set, determining a knowledge importance entity recognition algorithm as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; if the target field does not belong to the preset field set, determining a text entity algorithm based on correlation or a text entity algorithm based on a sequence from the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server as a target text entity recognition algorithm to be used; wherein the text entity algorithm based on the relevance comprises at least one of the following: an entity algorithm based on xgboost (eXtreme Gradient Boosting, extreme gradient lifting) classification, an entity algorithm based on word2vec, a texttrank graph walk algorithm, a word (text ranking) algorithm based on term (mention word) importance, and a ranking algorithm based on tf-idf (term frequency-inverse document frequency, information retrieval and data mining); the network structure of the text entity algorithm based on the sequence is BiLSTM-CRF (Bidirectional Long Short-Term Memory Network Conditional Random Field, based on a two-way long-short-term memory network conditional random field).

2) If the target mode comprises a visual mode, determining a corresponding target visual entity recognition algorithm according to the target visual category focused by the user.

Optionally, if the target mode includes a visual mode, a target visual category focused by the user is also obtained; if the target visual category is a human face, determining the human face entity recognition algorithm as a target visual entity recognition algorithm to be used from visual entity recognition algorithm categories of at least two candidate entity recognition algorithms provided by a server; if the target visual category is an object, determining the visual object recognition algorithm as a target visual object recognition algorithm to be used from visual object recognition algorithm categories of at least two candidate entity recognition algorithms provided by a server; and if the target visual category is the visual fingerprint, determining the visual fingerprint identification algorithm as a target visual entity identification algorithm to be used from visual entity identification algorithm categories of at least two candidate entity identification algorithms provided by the server.

3) If the target mode comprises an audio mode, determining a corresponding target audio entity identification algorithm according to the target entity precision of the user.

Optionally, if the target entity precision is the third precision, determining an audio entity algorithm based on acoustic characteristics as a target text entity recognition algorithm to be used from the audio entity recognition algorithm class of at least two candidate entity recognition algorithms provided by the server; if the target entity precision is the fourth precision, determining an audio entity algorithm based on voiceprint as a target text entity recognition algorithm to be used from the audio entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server; wherein the third precision is lower than the fourth precision.

The target entity recognition algorithm to be used is determined from at least two candidate entity recognition algorithms provided by the server according to at least one target mode, so that the multi-mode entity recognition requirement for the target video is met, and a foundation is laid for entity recognition of at least one target mode feature by subsequently calling the target entity recognition algorithm.

S104, invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature so as to obtain a target entity included in the target video.

By calling a target entity recognition algorithm to perform entity recognition on at least one target modal feature, the multi-modal entity recognition of the target video is realized, the recognition result accuracy is high, and different service requirements can be met.

According to the technical scheme provided by the embodiment of the invention, the corresponding target entity identification algorithm is called from the pre-configured candidate entity identification algorithm according to the target modal characteristics of the target video, and the entity identification is carried out on the target modal characteristics, so that the multi-modal entity identification of the target video is realized, the accuracy of the identification result is high, and different service requirements can be met.

On the basis of the above embodiment, after S104, further includes: and fusing the identified at least two entity information. For example, if at least two entity words are associated with the same entity, normalization processing is performed on the at least two entity words.

Specifically, cross checking is performed on the identified entity words according to the knowledge patterns stored in the knowledge base, if it is determined that at least two entity words are associated with the same entity according to the knowledge patterns, normalization processing is performed on the at least two entity words, and the normalized entity words are based on hot entity words in the knowledge patterns.

For example, assuming that the current three entity words "title a", "title B" and "title C" are three different titles of a certain actor, respectively, and "title a", "title B" and "title C" are all determined to refer to the actor according to the knowledge graph, so that the "title a", "title B" and "title C" are normalized, assuming that the "title a" is a hot title for the actor, the normalized entity word is set to "title a", and similarly, if the "title B" is a hot title for the actor, the normalized entity word is set to "title B".

And the entity identification effect is improved by fusing the identified at least two entity information, so that the entity identification result has higher accuracy.

On the basis of the above embodiment, after S104, further includes: and adjusting the confidence level of the entity according to the modal source information of the entity and/or the correlation of the entity in the knowledge graph.

Optionally, adjusting the confidence level of the entity according to the modal source information of the entity includes:

if any entity has at least two modal sources, the confidence of the entity is improved; if any entity has only a unique modal source and the unique modal source belongs to a third class of sources, the confidence of the entity is reduced.

In this embodiment, the third type of modal source includes an audio mode and an author tag text in a text mode source, i.e. if the unique modal source of a certain entity is the audio mode or the author tag text in the text mode source, the confidence of the entity is reduced.

Optionally, adjusting the confidence level of the entity according to the correlation of the entity in the knowledge graph includes:

if the correlation exists between at least two entities based on the knowledge graph in the knowledge base, the confidence coefficient of the two entities is improved; if the knowledge graph in the knowledge base is based on the fact that any entity word has no correlation with other entity words, the confidence level of the entity is reduced, or a contradictory relation exists between at least two entities, and the confidence level of the two entities is reduced.

For example, if "actor a" plays an opponent in "actor B" in a certain movie, then there is a correlation between "actor a" and "actor B", and the confidence level of the entity "actor a" and the entity "actor B" is correspondingly improved; for example, if "actor C" is used as a first number in a movie and "actor D" is used as a first number in the same movie, then there is a contradictory relationship between "actor C" and "actor D", and the confidence level between "actor C" and the entity "movie D" is correspondingly reduced.

By adjusting the confidence coefficient of the entity, the user can determine the video which the user wants to browse according to the confidence coefficient of the entity, and user experience is improved.

Example two

Fig. 2 is a flowchart of a method for identifying entities in a video scene according to a second embodiment of the present application. The present embodiment provides a specific implementation manner for the first embodiment, as shown in fig. 2, the method may include:

s201, acquiring a target video to be processed and at least one target mode.

S202, extracting at least one target modal characteristic of the target video.

S203, if the target mode comprises a text mode, executing S204; if the target mode includes a visual mode, executing S208; if the target modality includes an audio modality, S209 is executed.

S204, calling a video domain classification algorithm provided by the server, and determining the target domain to which the target video belongs.

The target field represents the category of the target video content, such as film, television, game, sports and the like. The video domain classification algorithm adopts a video secondary classification system, namely the obtained expression form of the target domain is a secondary classification form, such as a film and television domain-scenario film, and a sports domain-basketball.

Specifically, 1) a 3D convolutional neural network is pre-established, and a target video frame is extracted from a target video, wherein the target video frame can be a plurality of key frames or video frames extracted every second. 2) Information for multiple channels is generated from the target video frame using the established 3D convolutional neural network, and then the convolution and downsampling operations are performed separately for each channel. 3) Combining the information of all channels to obtain the final feature description. 4) And establishing a bidirectional LSTM sequence model added with a self-attention mechanism to perform sequence modeling on the feature description, and finally fusing and classifying to obtain the target field of the target video.

By determining the target field to which the target video belongs, a foundation is laid for determining a corresponding target entity identification algorithm according to different fields of the target video.

S205, determining whether the target field belongs to a preset field set; if yes, then execute S207; if not, S206 is performed.

The preset domain set is determined according to the attention degree of the user to the domain of the video, and a plurality of domains with higher attention degree of the user to the domain of the video are selected as the preset domain set. In order to improve the accuracy of extracting the target video entity, in the embodiment, according to the difference of the target field of the target video, different target text entity recognition algorithms are correspondingly called to perform entity recognition on the text modal characteristics.

Optionally, the preset domain set optionally includes at least one preset domain as follows: film and television fields, entertainment fields, animation fields, game fields, music fields, automobile fields, dance fields, food fields, sports fields and nature fields.

Specifically, comparing the target field with a preset field set, and if the preset set comprises the target field, determining that the target field belongs to the preset field set; if the target field is not included in the preset set, determining that the target field does not belong to the preset field set.

S206, determining a text entity algorithm based on correlation or a text entity algorithm based on sequence as a target text entity recognition algorithm to be used in the text entity recognition algorithm class of at least two candidate entity recognition algorithms provided from the server.

Wherein the text entity algorithm based on the relevance comprises at least one of the following: an entity algorithm based on xgboost classification, an entity algorithm based on word2vec, a textrank graph walk algorithm, a word rank algorithm based on term importance and a ranking algorithm based on tf-idf; the network structure of the sequence-based text entity algorithm is BiLSTM-CRF.

Specifically, the recognition accuracy of the text entity algorithm based on the correlation is lower than that of the text entity algorithm based on the sequence, but the recognition speed of the text entity algorithm based on the correlation is higher than that of the text entity algorithm based on the sequence. The user can select the target entity precision or the target entity speed in the client according to the actual service requirement.

Optionally, S206 includes:

acquiring the required target entity precision; if the target entity precision is the first precision, determining a text entity algorithm based on correlation as a target text entity recognition algorithm to be used from the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; if the target entity precision is the second precision, determining a sequence-based text entity algorithm as a target text entity recognition algorithm to be used from the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; wherein the first precision is lower than the second precision.

S207, determining a knowledge importance entity recognition algorithm as a target text entity recognition algorithm to be used in a text entity recognition algorithm class of at least two candidate entity recognition algorithms provided from a server.

The main flow of the knowledge importance entity recognition algorithm is as follows:

1) A named entity recognition algorithm provided by the server is invoked to identify candidate entities included in the text modality features.

Specifically, the candidate entity included in the text modal feature is determined by a named entity recognition tool, where the named entity recognition tool may be a named entity recognition based on a neural network, or may be a named entity recognition based on a feature template, and the embodiment is not limited. Candidate entities include three broad classes of named entities: entity class, time class, and number class, seven subclasses name entities: at least one named entity of person name, organization name, place name, time, date, currency, and percentage.

Illustratively, inputting a text modal feature into the named entity recognition tool will output candidate entities of the text modal feature accordingly, e.g., the text modal feature is "a player of a team," and the candidate entities of the text modal feature are "a player," a team, "and" a basketball player.

2) And determining the target entity category associated with the target domain according to the mapping relation between the preset domain and the preset entity category in the knowledge importance entity identification algorithm.

The preset fields all correspond to at least one preset entity category, and preset entity categories corresponding to different preset fields are also different, and for the same preset field, the preset entity category concerned by the user is basically fixed, for example, in the video field, the concerned points of the user are concentrated on the movie play name, the main role and the starring actor, so that the movie play name, the main role and the starring actor are taken as the preset entity categories of the movie field; also for example, in the automotive field, the user's attention is focused on the car brand, the car model, and the car name, and thus "car brand", "car model", and "car name" are regarded as preset entity categories of "automotive field".

Specifically, a schema (policy) system with mapping relation with preset entity categories is constructed aiming at different preset fields.

The mapping relationship between the preset domain and the preset entity category in the knowledge importance entity identification algorithm optionally comprises at least one of the following: at least one entity category associated with the film and television domain is as follows: movie and television play name, main character and starring actor; entertainment domain association is of at least one of the following entity categories: variety program names, guests, moderators, and entertainment characters involved; the animation domain is associated with at least one entity class as follows: cartoon name and main role; the game field is associated with at least one of the following entity categories: game name, primary character, and player; the music domain is associated with at least one of the following entity categories: music name and singer; the automotive field is associated with at least one of the following entity categories: automobile brands, types and names; the dance field is associated with at least one of the following entity categories: dance name and dance actor; the food field is associated with at least one of the following entity categories: a food name; the sports field is associated with at least one of the following entity categories: sports, athlete, and team names; the nature domain is associated with at least one of the following entity categories: animals, plants, mountain and rivers.

For example, assuming that the target domain to which the target video belongs is a "video domain", and the preset entity categories of the preset domain "video domain" in the knowledge importance entity recognition algorithm are "video play name", "main role" and "starring actor", the target entity categories associated with the target domain "video domain" to which the target video belongs are "video play name", "main role" and "starring actor"; assuming that the target domain to which the target video belongs is a "game domain", and the preset entity categories of the preset domain "game domain" in the knowledge importance entity recognition algorithm are "game name", "main role" and "player", the target entity categories associated with the target domain "game domain" to which the target video belongs are "game name", "main role" and "player".

3) And taking the candidate entity belonging to the target entity category as a target entity in the target video.

Specifically, the candidate entity is matched with the target entity category, and the candidate entity belonging to the target entity category is used as the target entity in the target video according to the matching result.

Optionally, based on knowledge graph information of the target entity class in the knowledge base, matching the candidate entity with the target entity class, and determining whether the candidate entity belongs to the target entity class.

For example, assuming that candidate entities are "actor a", "actor B", "character a", "theme song a", "episode a", and "movie a", and target entity categories are "movie play name", "main character", and "main actor", it is determined that "actor a" and "actor B" belong to the target entity category "main actor", and "character a" belongs to the target entity category "main character", and "movie a" belongs to the target entity category "movie play name", based on knowledge map information of "movie play name", "main character", and "main actor" in the knowledge base, and thus "actor a", "actor B", "character a", and "movie a" in the candidate entities are regarded as target entities in the target video.

Illustratively, assume that the candidate entity is "singer A", "singer B", "subject matter A", "subject matter B", etc.

"episode a", "musical instrument a", and "musical instrument B", the target entity categories are "music name" and "singer", then it is determined that "singer a", "singer B" belongs to the target entity category "singer", "subject song a", "subject song B", and "episode a" belong to the target entity category "music name" based on knowledge map information of "music name" and "singer" in the knowledge base, and therefore "singer a", "singer B", "subject song a", "subject song B", and "episode a" in the candidate entity are regarded as target entities in the target video.

Optionally, after "the candidate entity belonging to the target entity class is the target entity in the target video", the method further includes:

By establishing a link between the target entity and the entity in the knowledge graph, the obtained target entity can be corrected and expanded, and the richness of the entity identification result is increased.

S208, determining a target visual entity recognition algorithm to be used from visual entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server.

Specifically, the server may determine the target visual entity recognition algorithm according to the actual business requirements of the user.

Optionally, S208 includes:

acquiring a target visual category focused by a user; if the target visual category is a human face, determining the human face entity recognition algorithm as a target visual entity recognition algorithm to be used from visual entity recognition algorithm categories of at least two candidate entity recognition algorithms provided by a server; if the target visual category is an object, determining the visual object recognition algorithm as a target visual object recognition algorithm to be used from visual object recognition algorithm categories of at least two candidate entity recognition algorithms provided by a server; and if the target visual category is the visual fingerprint, determining the visual fingerprint identification algorithm as a target visual entity identification algorithm to be used from visual entity identification algorithm categories of at least two candidate entity identification algorithms provided by the server.

The face entity recognition algorithm optionally comprises a deepID algorithm and the like; the visual object recognition algorithm optionally includes yolo algorithm or the like.

S209, determining a target audio entity recognition algorithm to be used from the audio entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server.

Specifically, according to the target entity precision of the user, a corresponding target audio entity recognition algorithm is determined.

If the target entity precision is the third precision, determining an audio entity algorithm based on acoustic characteristics as a target text entity recognition algorithm to be used from the audio entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server; if the target entity precision is the fourth precision, determining an audio entity algorithm based on voiceprint as a target text entity recognition algorithm to be used from the audio entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server; wherein the third precision is lower than the fourth precision.

S210, invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature so as to obtain a target entity included in the target video.

For example, assuming that the target entity recognition algorithm is an entity algorithm based on xgboost classification, the text modal features are input into a pre-established xgboost text classifier to perform entity recognition and determination on the text modal features.

The text mode feature is input into a pre-established entity sequence labeling model to label the text mode feature in sequence to obtain a target entity, wherein the entity sequence labeling model is obtained by combining a bidirectional LSTM (local standard deviation) with a CRF sequence labeling model and combining large-scale video text feature corpus training.

According to the technical scheme provided by the embodiment of the application, if the target mode comprises a text mode, whether the target field to which the target video belongs to a preset field set is determined, if yes, a text entity algorithm based on correlation or a text entity algorithm based on sequence is called, the target text entity identification algorithm to be used is called, and if not, a knowledge importance entity identification algorithm is called, the target text entity identification algorithm to be used is called; if the target modality comprises a visual modality, determining a target visual entity recognition algorithm to be used from visual entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; if the target mode comprises an audio mode, determining a target audio entity identification algorithm to be used from the audio entity identification algorithm classes of at least two candidate entity identification algorithms provided by the server, so that multi-mode entity identification of the target video is realized, the accuracy of an identification result is high, and different service requirements can be met.

On the basis of the foregoing embodiment, if the target modality includes a text modality, before invoking a text entity recognition algorithm to perform entity recognition on the at least one target modality feature, the method further includes:

invoking a text quality model provided by a server to determine the text quality of the text modal feature; and screening the text modal characteristics according to the determination result.

The text quality model is used for determining the quality of text features, and optionally comprises a fasttext text classification model.

Optionally, inputting the text features into a pre-established fasttext text classification model to obtain the text quality of the text features, removing the text features lower than a preset text quality threshold, and retaining the text features higher than the preset text quality threshold.

Since knowledge information cannot be acquired from the low-quality text features, text quality of the rest text features is guaranteed by screening the text features according to the text quality, text noise of the text feature types is filtered, and the problem of title party is avoided.

On the basis of the above embodiment, after S210, further includes:

and marking a flattened entity label on the target video according to the obtained target entity, or marking a structured entity label on the target video according to the obtained target entity and the knowledge graph.

The flattened entity tag is a label which is formed by paralleling all entities. The structured entity tag then embodies the tag-to-tag relationship.

For example, "movie a" is played by "actor a" and "actor B," actor a "plays" role a, "actor B" plays "role B," role a "is in a friendship relationship with" role B.

By labeling the target video, the user can conveniently inquire the interested video according to the label according to the self requirement.

On the basis of the above embodiment, after S210, further includes:

the entity is subjected to vectorization semantic representation, so that semantic representation of the target video is obtained and is used for downstream application services, such as video semantic search, video recommendation and the like.

Example III

Fig. 3 is a schematic structural diagram of a video scene entity recognition device 300 according to a third embodiment of the present application, which can execute the video scene entity recognition method according to any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 3, the apparatus may include:

the target modality acquisition module 301 is configured to acquire a target video to be processed and at least one target modality;

A target modality feature extraction module 302, configured to extract at least one target modality feature of the target video;

a target entity recognition algorithm determining module 303, configured to determine a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by the server according to the at least one target modality; wherein the at least two candidate entity recognition algorithms are both deployed in the server;

the entity recognition module 304 is configured to invoke the target entity recognition algorithm to perform entity recognition on the at least one target modal feature, so as to obtain a target entity included in the target video.

On the basis of the above embodiment, the target entity identification algorithm determining module 303 is specifically configured to:

On the basis of the above embodiment, the target entity identification algorithm determining module 303 is specifically further configured to:

if the target field does not belong to the preset field set, determining a text entity algorithm based on correlation or a text entity algorithm based on a sequence from the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server as a target text entity recognition algorithm to be used;

wherein the text entity algorithm based on the relevance comprises at least one of the following: an entity algorithm based on xgboost classification, an entity algorithm based on word2vec, a textrank graph walk algorithm, a word rank algorithm based on term importance and a ranking algorithm based on tf-idf; the network structure of the text entity algorithm based on the sequence is BiLSTM-CRF.

On the basis of the above embodiment, the apparatus further includes an entity chain finger algorithm calling module, configured to:

acquiring the required target entity precision;

On the basis of the above embodiment, the apparatus further includes a text modality feature screening module configured to:

On the basis of the above embodiment, the apparatus further includes an entity information fusion module, configured to:

and fusing the identified at least two entity information.

For example, if at least two entity words are associated with the same entity, then for the at least two entities

On the basis of the above embodiment, the apparatus further includes a confidence adjustment module configured to:

The entity recognition device 300 for video scenes provided in the embodiment of the present application can execute the entity recognition method for video scenes provided in any embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be referred to an entity identification method for video scenes provided in any embodiment of the present application.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 4, a block diagram of an electronic device is provided for a method of entity identification of a video scene according to an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 4, the electronic device includes: one or more processors 401, memory 402, and interfaces for connecting the components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 401 is illustrated in fig. 4.

Memory 402 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method of entity identification for video scenes provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform a method of entity identification of a video scene provided by the present application.

The memory 402 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to a method for entity identification of a video scene in an embodiment of the present application (e.g., the target modality acquisition module 301, the target modality feature extraction module 302, the target entity identification algorithm determination module 303, and the entity identification module 304 shown in fig. 3). The processor 401 executes various functional applications of the server and data processing, i.e. a method of implementing entity recognition of video scenes in the above-described method embodiments, by running non-transitory software programs, instructions and modules stored in the memory 402.

Memory 402 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created from the use of the electronic device identified by the entity of the video scene, and the like. In addition, memory 402 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 402 may optionally include memory remotely located with respect to processor 401, which may be connected to the entity-identified electronic device of the video scene via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for entity identification of a video scene may further include: an input device 403 and an output device 404. The processor 401, memory 402, input device 403, and output device 404 may be connected by a bus or otherwise, for example in fig. 4.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device for entity recognition of the video scene, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a joystick, one or more mouse buttons, a track ball, a joystick, and the like. The output device 404 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, as a plurality of entity recognition algorithms are preset to perform entity recognition on videos in different fields, the entity described by the video text can be accurately extracted, and the integrity of recognition results is ensured.

…

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method for entity identification of a video scene, comprising:

acquiring a target video to be processed and at least one target mode;

extracting at least one target modal feature of the target video;

invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature to obtain a target entity included in the target video;

wherein the determining, according to the at least one target modality, a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by a server includes:

if the target mode comprises an audio mode, determining a target audio entity recognition algorithm to be used from audio entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server;

After the invoking the target entity recognition algorithm performs entity recognition on the at least one target modal feature, the method further includes:

if any entity has at least two modal sources, the confidence of the entity is improved;

if any entity has only a unique modal source and the unique modal source belongs to a third type of source, the confidence of the entity is reduced; wherein the third class of modal sources includes an audio modality and author tag text in a text modality source;

if the correlation exists between at least two entities based on the knowledge graph in the knowledge base, the confidence coefficient of the two entities is improved;

if the knowledge graph in the knowledge base is based, determining that any entity word has no correlation with other entity words, reducing the confidence coefficient of the entity, or determining that at least two entities have contradictory relations, and reducing the confidence coefficient of the two entities.

2. The method according to claim 1, wherein determining a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by a server according to the at least one target modality comprises:

And if the target mode comprises a visual mode, determining a target visual entity recognition algorithm to be used from visual entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server.

3. The method of claim 2, wherein if the target modality includes a text modality, determining a target text entity recognition algorithm to be used from among a class of text entity recognition algorithms of at least two candidate entity recognition algorithms provided by a server, comprises:

4. A method according to claim 3, wherein if the target text entity recognition algorithm is a knowledge importance entity recognition algorithm, invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature comprises:

5. The method of claim 4, wherein after taking the candidate entities belonging to the target entity class as target entities in the target video, further comprising:

6. A method according to claim 3, wherein determining a relevance-based text entity algorithm or a sequence-based text entity algorithm from among a class of text entity recognition algorithms of at least two candidate entity recognition algorithms provided by the server as the target text entity recognition algorithm to be used, comprises:

Acquiring the required target entity precision;

7. The method of claim 1, wherein if the target modality includes a text modality, invoking a text entity recognition algorithm to perform entity recognition on the at least one target modality feature, further comprising:

8. The method of claim 2, wherein if the target modality includes a visual modality, determining a target visual entity recognition algorithm to be used from among a visual entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server, comprises:

9. The method of claim 1, wherein invoking the target entity recognition algorithm to entity-recognize the at least one target modal feature further comprises:

and fusing the identified at least two entity information.

10. An apparatus for entity identification of a video scene, comprising:

the entity recognition module is used for calling the target entity recognition algorithm to perform entity recognition on the at least one target modal feature so as to obtain a target entity included in the target video;

the target entity identification algorithm determining module is specifically configured to:

The apparatus further includes a confidence adjustment module for:

the confidence adjustment module is further configured to:

11. The apparatus of claim 10, wherein the target entity identification algorithm determination module is specifically configured to:

12. The apparatus of claim 11, wherein the target entity identification algorithm determination module is further specifically configured to:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-9.