CN110879843B

CN110879843B - Method for constructing self-adaptive knowledge graph technology based on machine learning

Info

Publication number: CN110879843B
Application number: CN201910722435.XA
Authority: CN
Inventors: 赵继胜; 吴宇
Original assignee: Shanghai Fudian Intelligent Technology Co ltd
Current assignee: Shanghai Fudian Intelligent Technology Co ltd
Priority date: 2019-08-06
Filing date: 2019-08-06
Publication date: 2020-08-04
Anticipated expiration: 2039-08-06
Also published as: CN110879843A

Abstract

The invention provides a knowledge graph implementation technology for establishing indexes and associations for various information by using a machine learning technology. The present invention focuses on feature discrimination of information based on unstructured data and generation of a graph database system based on information association in conjunction with information association. Unlike a graph database system for structured information, the invention aims at the challenge formed by extraction and association (see attached figures) of unstructured data (such as images, audios and videos and the like) widely appearing in commercial application at present, takes machine learning feature extraction as a technical basis, and constructs a graph database index system which combines structured data and unstructured data and is based on feature association through an adaptive data feature correction technology realized along with data change, and realizes a knowledge graph according to the graph database index system, thereby realizing automatic knowledge graph construction for large-scale data. The technology can be widely applied to various data analysis and query scenes in an intelligent application environment.

Description

Method for constructing self-adaptive knowledge graph technology based on machine learning

Technical Field

The invention belongs to the technical field of information, and particularly relates to a technology for constructing a knowledge graph through a machine learning technology. The technology uses a deep neural network to extract features of different types of unstructured data, and performs adaptive information association on continuously updated knowledge base records in a self-adaptive mode on the basis, so that the processes of information acquisition and knowledge graph construction are simplified, and automatic knowledge graph construction can be performed on large-scale data. The technology can be widely applied to various data analysis and query scenes in an intelligent application environment. The technology can be widely applied to scenes such as business intelligence, intelligent information retrieval and automatic information association related to smart cities.

Background

The knowledge-graph expresses information in a manner of inter-monomer association, so the knowledge-graph is generally expressed in a graph form (as shown in figure 3). The relationship between 'mike' and 'jason' is 'teacher and student', where 'mike' and 'jason' are the information monomers and 'teacher and student' are the information associations between them. Knowledge maps serving as the basis of intelligent systems are widely applied to various scenes, including business intelligence, intelligent research and the like, which need to perform associated search on different types of knowledge points. With the continuous development of application scenes and requirements, data increment mainly comes from different types of unstructured data (for example, information association between two pictures), so that knowledge association is generated in an automatic manner by taking the unstructured data as a main information base (see fig. 2), more convenient technical support can be provided for constructing a business intelligent platform, and the method is also a technical problem of design of the current knowledge graph system. In the graph database, the association between the information monomers is realized by labeling between the monomers. For unstructured data, especially for information monomers with very similar characteristics (for example, different photos of 'mike' all express the same person), related labeling can be carried out in the same way, so that the workload of manually labeling a large amount of data can be avoided. Meanwhile, as the content and the mode of the associated information are manually modified, the automatic association between the subsequently added information monomers can be influenced.

The deep neural network is widely applied to the field of artificial intelligence data discrimination and analysis of different types, and makes good progress in the aspect of unstructured data processing. In particular, in terms of natural language processing, neural network techniques based on recurrent neural networks and variants thereof have been well-suited for speech recognition and speech and text feature extraction. In the field of graphic images, the deep convolutional network and the variants thereof are widely applied to the fields of intelligent security, medical health and the like, and great progress is made in feature extraction of pictures.

The present invention provides an automated knowledge graph construction technique based on feature extraction of unstructured data (see fig. 4) and applying the same information correlation to similar information with feature similarity. The technology can provide automatic management of unstructured data for the knowledge graph based on the intelligent application system, and provides great convenience for data acquisition and processing classification. Effective support is provided for business intelligence (product recommendation) and academic research (related information retrieval and search).

Disclosure of Invention

The invention designs an automatic association technology for unstructured information, and through automatic association, information can automatically associate subsequently input structured/unstructured information on the basis of limited user labeled associated information to form self-adaptive construction of a knowledge graph. The method specifically comprises the following steps:

1. providing the ability to automate feature vector generation for various types of unstructured information, including audio information, video information, text information, and picture information;

2. determining the capacity of similar information by comparing the similarity of the feature vectors;

3. and introducing the same information association label to similar information.

The self-adaptive knowledge graph construction facing to the unstructured information association (see the attached figures 1 and 5) comprises the following steps:

1. constructing a feature extraction training model (see fig. 4):

a. feature extraction model for text type: constructing a text vectorization model on the collected text materials by using a doc2vec technology;

b. feature extraction model for picture type: collecting pictures and classification labeling information as training samples, training a deep neural network through a resnet network architecture, and outputting a full-connection layer output of the trained network as a feature extraction vector;

c. for audio and video information generated by a recurrent neural network through a characteristic vector, a training data set is identified by using a label (usually adopting an audio and video name or an author), a prediction model based on the recurrent neural network is established, and then sequence coding of the trained recurrent neural network model is used as output, namely the characteristic vector is generated.

2. The information similarity comparison system comprises:

a. constructing a feature vector data table (see fig. 7) with (feature vectors, data entities) as units for each unstructured data, and sorting the table by the feature vectors;

b. newly inserted information monomers need to be recorded in the feature vector data table and are inserted into corresponding positions according to the sequence of the feature vectors;

c. checking the related content of the similar information according to the feature similarity;

d. establishing information association which is the same as the association content of similar information for the new information monomer;

3. automatically establishing information association:

a. searching similar information for the newly inserted data in the feature vector data table;

b. extracting the associated content of the similar information;

c. and establishing information association for the new information monomer, wherein the information association is the same as the association content of similar information.

The beneficial results of the technical scheme of the invention are as follows:

in the fields of business intelligence, financial intelligence development, academic information collection and the like, automatic information association needs to be carried out on massive unstructured information so as to quickly construct a knowledge graph. The prior art is limited to manual annotation information association, a large amount of similarity exists in unstructured information, and the requirement for continuously increasing information and updating a knowledge graph in time cannot be met through manual operation. According to the invention, the feature vector of the unstructured information is generated by using the deep neural network, and the similarity comparison of the feature vectors is combined, so that the capacity of constructing the knowledge graph in a self-adaptive manner is realized while massive unstructured information is acquired according to the similar information by adopting the same information correlation manner. The invention can automatically construct the knowledge graph while efficiently collecting data, and provides more accurate and convenient knowledge graph support mainly based on unstructured information for business intelligence. And an efficient technical platform is provided for large-scale unstructured data retrieval, information recommendation and analysis.

Drawings

FIG. 1 knowledge graph construction: manual tagging vs. automatic tagging based on machine learning

FIG. 2 structured/unstructured information knowledge graph

FIG. 3 structured information knowledge graph

FIG. 4 feature vector generation for various unstructured information

FIG. 5 implementation of a knowledge graph by Neo4J

FIG. 6 realizes the self-adaptive generation of the unstructured information knowledge graph by extracting and comparing the extended features of Neo4J

FIG. 7 feature vector data Table

Detailed Description

According to the analysis technology framework for constructing the unstructured data information association, which is set forth in the summary of the invention, the following sections are specifically realized: the knowledge graph system of the present invention is implemented by graph database Neo4J (see fig. 4), Neo4J being a widely used, stable graph data engine supporting both structured and unstructured information. The construction of the adaptive knowledge graph needs to be expanded to Neo4J in the following aspects (see fig. 6):

a. a feature vector generation system for unstructured information (see FIG. 6);

b. a feature vector data table (see fig. 7) for managing various unstructured information, the correspondence of each unstructured information to the corresponding feature vector data table being stored by a feature vector management table;

constructing a feature extraction training model:

a. feature vector expression capability for non-organisational and informative audio types: coding an audio signal through a recurrent neural network, wherein the recurrent neural network has a structure of 1000 input units and 500 hidden neurons;

b. the method comprises the steps of extracting the characteristics of unstructured information of text types and expressing the characteristics in a vectorization mode, wherein an algorithm is based on doc2vec, the algorithm is an extension of a *** word vector technology, and accurate characteristic capture and characteristic vector generation of the text information are achieved by adopting a wide sampling window (the sampling width is 200);

c. the feature vectorization expression capability of the unstructured information of the picture type: using a residual error network resnet-50 as a feature extraction algorithm, and outputting the feature vector through a full connection layer of the algorithm, wherein the length of the feature vector is set to be 128;

d. feature vector expression capability for unstructured information of video type: the feature vector generation of the video information needs to encode frames periodically intercepted from a video by adopting a picture-based feature vector generation technology in 3 (feature vector generation, the length of the feature vector of each frame is set to be 32, and the sampling number is 128), and then the vector set is re-encoded through a recurrent neural network, so that the feature vector corresponding to the video information is generated, and the recurrent neural network architecture for encoding is 4096 input units and 800 hidden neurons.

Training data:

a. for a feature extraction model of a text type, collecting text materials as a training data set;

b. for the feature extraction model of the picture type, pictures and classification marking information are required to be collected as training samples;

c. for audiovisual information generated by feature vectors through a recurrent neural network, the training data set is identified by tags (usually with audiovisual names or authors).

The information similarity comparison system comprises:

a. for each unstructured data, creating a feature vector data table with (feature vectors, data entities) as units in Neo4J, the table being sorted by feature vectors;

automatically establishing information association:

c. for the newly inserted information monomers I, finding k information monomers [ J0, J1, … Jk-1] with the closest similarity from the corresponding feature vector data table;

d. for similar information monomers [ J0, J1, … Jk-1], collecting an associated information set Rj;

e. adding all the associated information in Rj to the information monomer I;

the ordering of the eigenvectors is according to a standard geometric vector ordering.

The similarity comparison method is to calculate the divergence value of K L between two feature vectors, and the number K of similar monomers is usually set to 3 or 5.

Claims

1. A method for constructing an adaptive knowledge graph technology based on machine learning is characterized by comprising the following steps:

the technology for constructing the knowledge graph through the machine learning technology uses a deep neural network to perform feature extraction on different types of unstructured data, and performs self-adaptive information association on continuously updated knowledge base records in a self-adaptive mode on the basis, so that the processes of information acquisition and knowledge graph construction are simplified, and automatic knowledge graph construction can be performed on large-scale data; the technology can be widely applied to various data analysis and query scenes in an intelligent application environment;

the method designs an automatic association technology for unstructured information, and through automatic association, information can automatically associate subsequently input structured/unstructured information on the basis of limited user labeled associated information to form self-adaptive construction of a knowledge graph, and specifically comprises the following steps:

step A, providing the capability of automatic feature vector generation of various types of unstructured information, including audio information, video information, text information and picture information;

b, comparing the similarity of the feature vectors to determine the capacity of similar information;

step C, introducing the same information association label to the similar information;

steps a-C are implemented by extending an existing open source or business version graph database system, the required extension modules including: a feature extraction system and a feature comparison system based on machine learning;

the feature alignment system comprises: maintaining the corresponding relation between each information monomer and the characteristic vector thereof by using a characteristic vector data table;

providing a feature vector data table for managing various unstructured information in a graph database, wherein the corresponding relation between each unstructured information and the corresponding feature vector data table is stored by a feature vector management table;

the feature comparison system provides an adaptive knowledge mapping system implementation based on a graph database Neo 4J:

for each unstructured data, creating a feature vector data table with feature vectors and data entities as units in Neo4J, wherein the table is sorted by the feature vectors;

newly inserting information monomers into the vector database Neo4J, wherein the information monomers need to be recorded in a feature vector data table, and are inserted into corresponding positions according to the sequence of feature vectors;

and finding k information monomers with the closest similarity from a corresponding feature vector data table for the information monomer I newly inserted into the vector database Neo4J, collecting a related information set Rj of the information monomers, and adding all related information in the Rj for the information monomer I, thereby realizing automatic labeling of the information monomer I.

2. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein an unstructured information oriented automatic association technology is implemented, and through automatic association, information can automatically associate subsequently input structured/unstructured information on the basis of limited user labeled associated information to form adaptive construction of a knowledge graph.

3. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein the ability of similar information is determined by similarity comparison of feature vectors.

4. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein the same information association labels are introduced to similar information.

5. The method for constructing the adaptive knowledge graph technology based on the machine learning as claimed in claim 1, characterized in that the feature extraction and vectorization expression of the unstructured information of text type are based on doc2vec, the algorithm is an extension of *** word vector technology, and the accurate feature capture and feature vector generation of the text information are realized by adopting a wide sampling window;

the sampling width of the wide sampling window is 200.

6. The method for constructing an adaptive knowledge-graph technology based on machine learning according to claim 1, wherein the capability of expressing the feature vectors of the non-mechanism and information of the audio type is as follows: the audio signal is encoded through a recurrent neural network, and the structure of the recurrent neural network is 1000 input units and 500 hidden neurons.

7. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein the feature vectorization expression capability of the unstructured information of picture types is as follows: the residual error network resnet-50 is used as a feature extraction algorithm, and is output as a feature vector through a full connection layer, and the length of the feature vector is set to be 128.

8. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein the feature vector expression capability for unstructured information of video type is as follows: generating a feature vector of video information, namely encoding frames periodically intercepted from a video by adopting a picture-based feature vector generation technology in 3, and then re-encoding a vector set through a recurrent neural network to generate the feature vector corresponding to the video information, wherein the recurrent neural network for encoding is constructed by 4096 input units and 800 hidden neurons;

the feature vector generation technique: a feature vector is generated with a feature vector length set to 32 for each frame and a number of samples of 128.

9. The method for constructing an adaptive knowledge-graph technique based on machine learning of claim 1, wherein the ordering of feature vectors is according to a standard geometric vector ordering.

10. The method for constructing an adaptive knowledge graph technology based on machine learning of claim 1, wherein the similarity comparison method is to calculate a K L divergence value between two feature vectors, and the number K of similar monomers is usually set to 3 or 5.