CN110750656B

CN110750656B - Multimedia detection method based on knowledge graph

Info

Publication number: CN110750656B
Application number: CN201911036867.1A
Authority: CN
Inventors: 袁赛杰; 谢赟; 韩欣; 许青青
Original assignee: Shanghai Datatom Information Technology Co ltd
Current assignee: Shanghai Datatom Information Technology Co ltd
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2023-03-14
Anticipated expiration: 2039-10-29
Also published as: CN110750656A

Abstract

The invention discloses a multimedia detection method based on a knowledge graph, which comprises the following steps: each user uploads a shared multimedia file and a custom tag to a multimedia database; aiming at picture files and video files containing the portrait in the multimedia files, processing through face detection, face recognition and face comparison to construct a portrait library; processing the picture file and the video file with the portrait removed in the multimedia file through image classification and target detection to identify a scene and a real object; classifying the text files in the multimedia files through two classifiers, namely a general classifier and an education classifier, and marking category labels according to classification results; composing a document map for each user; constructing a character relationship network; and based on the file map, the character library and the character relation network, the user searches. And multimedia attribute association retrieval is carried out on the basis of the knowledge graph, so that the use by a user is facilitated.

Description

Multimedia detection method based on knowledge graph

Technical Field

The invention relates to the technical field of information retrieval, in particular to a multimedia detection method based on a knowledge graph.

Background

At present, image and text retrieval is mainly focused on single-mode retrieval, and the contents of a query and a candidate set belong to the same mode. And Cross-modal retrieval (Cross-modal retrieval) completes the expression and transformation of information in different morphological spaces by establishing an information mapping relation among multiple modalities, and finally realizes the retrieval of crossing information resource morphological differences. With the development and enrichment of multimedia technologies, the demand for retrieval among multiple modalities is becoming stronger. The biggest problem faced by the current cross-modal retrieval is how to better realize mutual recognition and retrieval among multiple modes such as texts, images and the like, which is the target and meaning of the cross-modal retrieval. After the knowledge graph is introduced into the multimedia search system, the method is beneficial to obtaining situation data under different searches, better supports a user to express search intention in natural language, can discover characteristics under different situations through further reasoning, and realizes more accurate user query semantic analysis and search.

With the development of artificial intelligence and the dramatic increase in knowledge demand, knowledge-graphs have received a great deal of attention in the industry and academia. The knowledge graph is proposed by Google corporation in 2012, and is particularly used for a knowledge base for improving the performance of a search engine; the generalized knowledge graph refers to all kinds of knowledge base items. The knowledge graph aggregates various information, data and link relations into knowledge, and is an effective organization method of knowledge in a big data environment. Various large-scale knowledge maps play an important role in the fields of intelligent search, intelligent question answering, intelligent recommendation, information analysis, anti-fraud, user input disambiguation, social networking, finance, medical treatment, e-commerce, education and scientific research and the like. With the rapid growth in scale, understanding, analysis, and utilization of large-scale knowledge maps remains a challenge. The visualization maps abstract data into graphic elements and is assisted with a human-computer interaction means to help a user to effectively perceive and analyze the data. Therefore, the introduction of the knowledge graph into the multimedia retrieval plays an important role in improving the retrieval effect.

Disclosure of Invention

The invention aims to provide a knowledge graph-based multimedia detection method, which is used for performing multimedia attribute association retrieval on the basis of a knowledge graph and is convenient for users to use.

The technical scheme for realizing the purpose is as follows:

a knowledge-graph-based multimedia detection method comprises the following steps:

s1, constructing a multimedia database, and uploading a shared multimedia file and a custom tag to the multimedia database by each user;

s2, aiming at picture files and video files containing the portrait in the multimedia files, processing through face detection, face recognition and face comparison to construct a portrait library; processing the picture file and the video file with the portrait removed in the multimedia file through image classification and target detection to identify a scene and a real object; classifying the text files in the multimedia files through two classifiers, namely a general classifier and an education classifier, and marking category labels according to classification results;

s3, analyzing the uploaded multimedia files by each user in the S2 to obtain portrait data, scene data, real object data and text file classification data, associating the portrait data, the scene data, the real object data and the text file classification data with each multimedia file on one hand, and associating the user with the uploaded multimedia files, the custom tags and the portrait data to form a file map on the other hand;

s4, determining the relationship type between two characters according to the number of group photos and the number of people on the group photos existing between every two characters in the constructed portrait library, and constructing a character relationship network;

and S5, searching by the user based on the file map, the character library and the character relation network.

Preferably, the step S2 includes:

when a human image is detected in a key frame of a picture file or a video file, filtering a side face and a fuzzy face through human face detection, vectorizing and expressing human image information through human face identification, comparing the vectorized human image information with data in a human image library, calculating similarity and performing descending order, considering that matching is successful when the similarity is highest and is greater than a preset value, adding new human image information into the human image library, otherwise, adding new data in the human image library when matching is failed, and constructing the human image library in such a way;

and aiming at the key frames of the picture file or the video file with the portrait removed, carrying out image classification and target detection by using a target detection model so as to identify related scenes or real objects, and then classifying the real objects by using a deep convolutional network mode.

Preferably, the face detection means: calculating the distance ratio from the canthus to the tail of the eyes of the portrait by the distance of the characteristic points, judging that the face is in the lateral direction when the distance ratio is greater than a preset multiple, and filtering the face; and performing edge detection by using a Laplace operator, and when the Laplace operator value is smaller than a preset value, considering the portrait image as a fuzzy face and filtering the fuzzy face.

Preferably, the face detection and the face recognition are performed by using a face recognition model based on angles.

Preferably, the human relationship in step S4 includes:

when two people appear in the same photo in the form of a group photo, the two people are considered to be in the same frame relationship;

when the number of the group photos of the two people is larger than a first preset value and not larger than a third preset value, and the number of people in each photo is smaller than a second preset value, the two people are considered to be in an artificial recognition relationship;

and when the number of the group photos of the two people is larger than a third preset value and the number of people in each photo is smaller than the second preset value, the two people are considered to be in close relationship.

Preferably, in step S3, the document map records operation behaviors of other users on data in the document map, and associates the operation behaviors with each other.

Preferably, the general class of text categories includes sports, finance, real estate, home, education, science and technology, fashion, games, entertainment, lottery, stock, society, and constellation; education text classification includes instrument, party administration, capital construction, field, teaching, scientific research, administration, financial accounting.

Preferably, the step S5 includes:

the user searches in a picture mode: representing the human image information in a vectorization way through human face detection and human face recognition, and/or processing through image classification and target detection to recognize scenes and real objects; then retrieving and matching the vectorized portrait information with the file map, the character library and the portrait in the character relation network according to the similarity, and/or retrieving and matching the identified scene and the identified real object with the scene and the real object in the file map according to the similarity; or alternatively

The user searches keywords in natural language: performing text error correction and correction on the keywords by combining a word bank and a shortest editing distance method, and performing retrieval matching from a file map, a character bank and a character relation network;

and the user searches and matches a corresponding result, preferentially displays related picture, video, text and homepage links corresponding to the result, and gives an option taking a search intention as a general content search.

Preferably, in the step S5,

when the matched picture or the keyword information is the scene name, preferentially displaying the related picture of the same scene to form a user file network map taking the scene name as the center, and providing an option entry taking the scene name as the general content retrieval;

when the matched picture or the keyword information is the name of the article, preferentially displaying the related picture containing the article, forming a user file network map taking the article as the center, and providing an option entry taking the article name as a general content retrieval;

when the matched keyword information is a file category name, preferentially displaying related documents under the category name to form a user file network map taking the file category as a center, and providing an option entry taking the file category name as a general content retrieval;

when the matched picture information is the name of the portrait, the related picture and video of the portrait are preferentially displayed to form a user file network map taking the portrait as the center, and an option entry taking the name of the portrait as a general content retrieval is provided;

when the matched keyword information is a user name, preferentially displaying the homepage link of the user to form a user file network map taking the user as the center, and providing an option entry taking the user name as general content retrieval;

when the matched keyword information is a custom tag, preferentially displaying a related file containing the custom tag, forming a user file network map taking the related file of the custom tag as a center, and providing an option entry taking the custom tag as a general content retrieval;

and when the matching is not successful, the picture or the keyword information is used as the general content, the general content is searched, and the relevant corresponding result is displayed.

The invention has the beneficial effects that: the invention is based on the knowledge graph, combines the multimedia file analysis technology, namely picture, image recognition (including face detection, filtration, recognition, comparison, object and scene recognition) and text classification (using two classification methods of a general classifier and a special education to carry out double-layer classification), and the like, obtains retrieval data of various dimensions which are consistent with the query content of the current user, and visualizes the retrieval data in the form of the knowledge graph, thereby facilitating the better understanding of the current user. The development of the retrieval technology towards the multi-mode retrieval and intelligent retrieval is promoted, and the method has very high practical application value and wide application prospect.

Drawings

FIG. 1 is a schematic flow diagram of a method for knowledgegraph-based multimedia detection in accordance with the present invention;

FIG. 2 is a schematic diagram of a process for constructing a character library according to the present invention;

FIG. 3 is a schematic diagram of the relationship definition of the characters in the present invention;

fig. 4 is a schematic diagram of the search process in the present invention.

Detailed Description

The invention will be further explained with reference to the drawings.

Referring to fig. 1, the method for detecting multimedia based on knowledge-graph of the present invention includes the following steps:

s1, a multimedia database is established, and each user uploads a shared multimedia file and a custom label to the multimedia database. Because the related knowledge of the current dissemination entity is no longer represented by only a single medium, the information is often disseminated in a multimedia multi-channel manner. The entity information of various media forms (pictures, videos, texts, users and the like) is analyzed by utilizing technologies such as image classification, target detection, text classification and the like. The multimedia file comprises natural attributes (such as the uploading place, time, size, file type and the like) and social attributes (such as the classified category of the file, the portrait, the real object, the scene and the like identified in the file).

And S2, aiming at the picture file and the video file containing the portrait in the multimedia file, processing through face detection, face recognition and face comparison to construct a portrait library. As shown in fig. 2, specifically:

firstly, face detection and face recognition are carried out by utilizing an open-source face recognition model based on angles. When a human image is detected in a key frame of a picture file or a video file, a side face and a fuzzy face are filtered through human face detection, the distance ratio from the canthus to the tail of eyes of two eyes of the human image is calculated mainly through the distance of characteristic points, the human face is determined to be in a side direction when the distance ratio is larger than a preset multiple (for example, 3), and the side face and the fuzzy face are filtered. And performing edge detection by using the Laplacian in the open-source image processing library. The laplacian operator determines a slowly changing edge line by using a zero-crossing point between a second-order differential positive peak and a negative peak, and highlights isolated points, isolated lines and line end points in the image. When the laplacian operator value is less than a predetermined value (e.g., 12), the image is filtered as a blurred face, and a clear image with operator value greater than 12 is retained. The laplacian is the simplest isotropic differential operator, with rotational invariance. The laplace transform of a two-dimensional image function is the isotropic second derivative, defined as:

where f is the image function and x, y represent the derivatives in different directions.

And then, expressing the human image information in a vectorization mode through face recognition, namely, utilizing the reciprocal second-layer network output vector of the face recognition model as the human face information vectorization mode, then carrying out face comparison on the vectorized human image information and data in a human image library, calculating the similarity and carrying out descending arrangement, judging that the matching is successful if the similarity is highest and is greater than a preset value (for example, 70 percent), adding new human image information into the human image library, otherwise, adding new data in the human image library when the matching is failed, and constructing the human image library in such a way.

The text files in the multimedia files are classified through two classifiers, namely a general classifier and an education classifier, and category labels are marked according to classification results. The general class is trained and classified by using a Chinese text classification training set (Thunews) of a Natural Language Processing (NLP) laboratory of Qinghua university, and an algorithm (TextCNN) classification algorithm for classifying texts by using a convolutional neural network is trained to classify texts into 14 classes of 'sports', 'finance', 'real estate', 'home', 'education', 'science', 'fashion', 'political', 'game', 'entertainment', 'lottery', 'stock', 'society', 'constellation', and the like, and the accuracy is 98.7 percent higher. The education category is formed by forming a test set through articles of various categories of colleges and universities in the mobile phone of the education website, text classification training is carried out through TextCNN, the text is divided into 8 categories of 'instrument equipment', 'party administration', 'capital construction', 'field affair', 'teaching', 'scientific research', 'administrative', 'financial accounting', and the like, and the accuracy rate is 93%. Uploading a text file by a user, obtaining the file name and the detailed content of the text file, performing word-removing, word-dividing and feature vectorization processing, classifying the text data by using a general classifier, and marking the text with a corresponding class label if the text is classified into other 13 classes which are not education classes; and if the text is classified into education, continuously classifying the text by using the education special classifier, and marking a corresponding class label according to a classification result.

And S3, analyzing the uploaded multimedia files by the step S2 for each user to obtain portrait data, scene data, real object data and text file classification data, associating the portrait data, the scene data, the real object data and the text file classification data with each multimedia file on one hand, and associating the user with the uploaded multimedia files, the custom tags and the portrait data on the other hand respectively to preliminarily form a network structure represented by a knowledge graph, namely a file graph. And then, the behaviors of other users operating the data in the file map on the platform can be recorded (such as previewing, downloading, uploading and the like) in association relation, and the data are updated into the map context, so that the map information is continuously enriched and displayed.

And S4, determining the relationship type between two persons according to the number of group photos and the number of people on the group photos existing between every two persons in the constructed person photo library, and constructing a person relationship network. As shown in fig. 3, specifically, the human relationships include same frame, recognition and intimacy, as follows:

when the number of the group photos of the two people is larger than a first preset value (for example, 3) and not larger than a third preset value (for example, 8), and the number of people in each photo is smaller than a second preset value (for example, 5), the two people are considered as the cognitive relationship;

when the number of the group photos of the two people is larger than a third preset value (for example, 8) and the number of people in each photo is smaller than the second preset value (for example, 5), the two people are considered to be in close relationship.

And S5, searching by the user based on the file map, the character library and the character relation network. When the user inquires, the user can search in a multimedia mode (natural language such as a person name, a file name, a keyword, a file such as a picture and the like), and a file map of a search word or a person relation map of the person name and other basic results such as file information, map information, person information and the like can be obtained. For the query content input by the user, the text and other multimedia content are respectively analyzed and combined, so that the query intention of the user can be better analyzed, and a desired result can be fed back. As shown in fig. 4, in particular,

the user searches in a picture mode: representing the human image information in a vectorization way through human face detection and human face recognition, and/or processing through image classification and target detection to recognize scenes and real objects; then retrieving and matching the vectorized portrait information with the file map, the character library and the portrait in the character relation network according to the similarity, and/or retrieving and matching the identified scene and the identified real object with the scene and the real object in the file map according to the similarity; or

and the user searches and matches a corresponding result, preferentially displays related picture, video, text and homepage links corresponding to the result, and gives an option taking a search intention as a general content search. The following is divided:

Based on the incidence relation and standardized information of the document map such as the real object social attribute and the natural attribute, the corresponding information in the document map is integrated to construct related applications such as character relation, portrait retrieval, keyword search and map search, so that a multimedia omnibearing multi-angle retrieval system based on the document map technology is realized, a relatively complete knowledge system is constructed, the search breadth and depth are improved, and the search result is better displayed.

The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention, and therefore all equivalent technical solutions should also fall within the scope of the present invention, and should be defined by the claims.

Claims

1. A knowledge graph-based multimedia detection method is characterized by comprising the following steps:

s3, analyzing the uploaded multimedia files by the S2 to obtain portrait data, scene data, real object data and text file classification data for each user, wherein on one hand, the portrait data, the scene data, the real object data and the text file classification data are associated with each multimedia file, and on the other hand, the user is respectively associated with the uploaded multimedia files, the custom tags and the portrait data to form a file map;

2. The method for detecting multimedia based on knowledge-graph according to claim 1, wherein the step S2 comprises:

when a human image is detected in a key frame of a picture file or a video file, a side face and a fuzzy face are filtered through face detection, then human image information is vectorized and expressed through face identification, then the vectorized human image information is compared with data in a human image library through the face, the similarity is calculated and is arranged in a descending order, if the similarity is the highest and is greater than a preset value, the matching is considered to be successful, new human image information is added into the human image library, otherwise, when the matching is failed, data are newly added into the human image library, and the human image library is constructed in the mode;

3. The method of knowledge-graph-based multimedia detection according to claim 2, wherein the face detection means: calculating the distance ratio from the canthus to the tail of the eyes of the portrait by the distance of the characteristic points, judging that the face is in the lateral direction when the distance ratio is greater than a preset multiple, and filtering the face; and performing edge detection by using a Laplace operator, and when the Laplace operator value is smaller than a preset value, considering the portrait image as a fuzzy face and filtering the fuzzy face.

4. The method of claim 2, wherein the face detection and face recognition are performed using an angle-based face recognition model.

5. The method of claim 1, wherein the human relationship in step S4 comprises:

6. The knowledge-graph-based multimedia detection method according to claim 1, wherein in the step S3, the document graph records operation behaviors of other users on data in the document graph and associates the operation behaviors with the data in the document graph.

7. The method of claim 1, wherein the generic text classes comprise sports, finance, real estate, home, education, science and technology, fashion, games, entertainment, lottery, stock, social, constellation; education text classification includes instrument, party administration, capital construction, field, teaching, scientific research, administration, financial accounting.

8. The method for detecting multimedia based on knowledge-graph according to claim 1, wherein the step S5 comprises:

The user searches keywords in natural language: performing text error correction and correction on the keywords by using a word bank combined with a shortest editing distance method, and performing retrieval matching from a file map, a character bank and a character relation network;

9. The method for knowledge-graph based multimedia detection according to claim 8, wherein in step S5,