CN110879843B - Method for constructing self-adaptive knowledge graph technology based on machine learning - Google Patents

Method for constructing self-adaptive knowledge graph technology based on machine learning Download PDF

Info

Publication number
CN110879843B
CN110879843B CN201910722435.XA CN201910722435A CN110879843B CN 110879843 B CN110879843 B CN 110879843B CN 201910722435 A CN201910722435 A CN 201910722435A CN 110879843 B CN110879843 B CN 110879843B
Authority
CN
China
Prior art keywords
information
feature
knowledge graph
machine learning
unstructured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910722435.XA
Other languages
Chinese (zh)
Other versions
CN110879843A (en
Inventor
赵继胜
吴宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Fudian Intelligent Technology Co ltd
Original Assignee
Shanghai Fudian Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Fudian Intelligent Technology Co ltd filed Critical Shanghai Fudian Intelligent Technology Co ltd
Priority to CN201910722435.XA priority Critical patent/CN110879843B/en
Publication of CN110879843A publication Critical patent/CN110879843A/en
Application granted granted Critical
Publication of CN110879843B publication Critical patent/CN110879843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a knowledge graph implementation technology for establishing indexes and associations for various information by using a machine learning technology. The present invention focuses on feature discrimination of information based on unstructured data and generation of a graph database system based on information association in conjunction with information association. Unlike a graph database system for structured information, the invention aims at the challenge formed by extraction and association (see attached figures) of unstructured data (such as images, audios and videos and the like) widely appearing in commercial application at present, takes machine learning feature extraction as a technical basis, and constructs a graph database index system which combines structured data and unstructured data and is based on feature association through an adaptive data feature correction technology realized along with data change, and realizes a knowledge graph according to the graph database index system, thereby realizing automatic knowledge graph construction for large-scale data. The technology can be widely applied to various data analysis and query scenes in an intelligent application environment.

Description

Method for constructing self-adaptive knowledge graph technology based on machine learning
Technical Field
The invention belongs to the technical field of information, and particularly relates to a technology for constructing a knowledge graph through a machine learning technology. The technology uses a deep neural network to extract features of different types of unstructured data, and performs adaptive information association on continuously updated knowledge base records in a self-adaptive mode on the basis, so that the processes of information acquisition and knowledge graph construction are simplified, and automatic knowledge graph construction can be performed on large-scale data. The technology can be widely applied to various data analysis and query scenes in an intelligent application environment. The technology can be widely applied to scenes such as business intelligence, intelligent information retrieval and automatic information association related to smart cities.
Background
The knowledge-graph expresses information in a manner of inter-monomer association, so the knowledge-graph is generally expressed in a graph form (as shown in figure 3). The relationship between 'mike' and 'jason' is 'teacher and student', where 'mike' and 'jason' are the information monomers and 'teacher and student' are the information associations between them. Knowledge maps serving as the basis of intelligent systems are widely applied to various scenes, including business intelligence, intelligent research and the like, which need to perform associated search on different types of knowledge points. With the continuous development of application scenes and requirements, data increment mainly comes from different types of unstructured data (for example, information association between two pictures), so that knowledge association is generated in an automatic manner by taking the unstructured data as a main information base (see fig. 2), more convenient technical support can be provided for constructing a business intelligent platform, and the method is also a technical problem of design of the current knowledge graph system. In the graph database, the association between the information monomers is realized by labeling between the monomers. For unstructured data, especially for information monomers with very similar characteristics (for example, different photos of 'mike' all express the same person), related labeling can be carried out in the same way, so that the workload of manually labeling a large amount of data can be avoided. Meanwhile, as the content and the mode of the associated information are manually modified, the automatic association between the subsequently added information monomers can be influenced.
The deep neural network is widely applied to the field of artificial intelligence data discrimination and analysis of different types, and makes good progress in the aspect of unstructured data processing. In particular, in terms of natural language processing, neural network techniques based on recurrent neural networks and variants thereof have been well-suited for speech recognition and speech and text feature extraction. In the field of graphic images, the deep convolutional network and the variants thereof are widely applied to the fields of intelligent security, medical health and the like, and great progress is made in feature extraction of pictures.
The present invention provides an automated knowledge graph construction technique based on feature extraction of unstructured data (see fig. 4) and applying the same information correlation to similar information with feature similarity. The technology can provide automatic management of unstructured data for the knowledge graph based on the intelligent application system, and provides great convenience for data acquisition and processing classification. Effective support is provided for business intelligence (product recommendation) and academic research (related information retrieval and search).
Disclosure of Invention
The invention designs an automatic association technology for unstructured information, and through automatic association, information can automatically associate subsequently input structured/unstructured information on the basis of limited user labeled associated information to form self-adaptive construction of a knowledge graph. The method specifically comprises the following steps:
1. providing the ability to automate feature vector generation for various types of unstructured information, including audio information, video information, text information, and picture information;
2. determining the capacity of similar information by comparing the similarity of the feature vectors;
3. and introducing the same information association label to similar information.
The self-adaptive knowledge graph construction facing to the unstructured information association (see the attached figures 1 and 5) comprises the following steps:
1. constructing a feature extraction training model (see fig. 4):
a. feature extraction model for text type: constructing a text vectorization model on the collected text materials by using a doc2vec technology;
b. feature extraction model for picture type: collecting pictures and classification labeling information as training samples, training a deep neural network through a resnet network architecture, and outputting a full-connection layer output of the trained network as a feature extraction vector;
c. for audio and video information generated by a recurrent neural network through a characteristic vector, a training data set is identified by using a label (usually adopting an audio and video name or an author), a prediction model based on the recurrent neural network is established, and then sequence coding of the trained recurrent neural network model is used as output, namely the characteristic vector is generated.
2. The information similarity comparison system comprises:
a. constructing a feature vector data table (see fig. 7) with (feature vectors, data entities) as units for each unstructured data, and sorting the table by the feature vectors;
b. newly inserted information monomers need to be recorded in the feature vector data table and are inserted into corresponding positions according to the sequence of the feature vectors;
c. checking the related content of the similar information according to the feature similarity;
d. establishing information association which is the same as the association content of similar information for the new information monomer;
3. automatically establishing information association:
a. searching similar information for the newly inserted data in the feature vector data table;
b. extracting the associated content of the similar information;
c. and establishing information association for the new information monomer, wherein the information association is the same as the association content of similar information.
The beneficial results of the technical scheme of the invention are as follows:
in the fields of business intelligence, financial intelligence development, academic information collection and the like, automatic information association needs to be carried out on massive unstructured information so as to quickly construct a knowledge graph. The prior art is limited to manual annotation information association, a large amount of similarity exists in unstructured information, and the requirement for continuously increasing information and updating a knowledge graph in time cannot be met through manual operation. According to the invention, the feature vector of the unstructured information is generated by using the deep neural network, and the similarity comparison of the feature vectors is combined, so that the capacity of constructing the knowledge graph in a self-adaptive manner is realized while massive unstructured information is acquired according to the similar information by adopting the same information correlation manner. The invention can automatically construct the knowledge graph while efficiently collecting data, and provides more accurate and convenient knowledge graph support mainly based on unstructured information for business intelligence. And an efficient technical platform is provided for large-scale unstructured data retrieval, information recommendation and analysis.
Drawings
FIG. 1 knowledge graph construction: manual tagging vs. automatic tagging based on machine learning
FIG. 2 structured/unstructured information knowledge graph
FIG. 3 structured information knowledge graph
FIG. 4 feature vector generation for various unstructured information
FIG. 5 implementation of a knowledge graph by Neo4J
FIG. 6 realizes the self-adaptive generation of the unstructured information knowledge graph by extracting and comparing the extended features of Neo4J
FIG. 7 feature vector data Table
Detailed Description
According to the analysis technology framework for constructing the unstructured data information association, which is set forth in the summary of the invention, the following sections are specifically realized: the knowledge graph system of the present invention is implemented by graph database Neo4J (see fig. 4), Neo4J being a widely used, stable graph data engine supporting both structured and unstructured information. The construction of the adaptive knowledge graph needs to be expanded to Neo4J in the following aspects (see fig. 6):
a. a feature vector generation system for unstructured information (see FIG. 6);
b. a feature vector data table (see fig. 7) for managing various unstructured information, the correspondence of each unstructured information to the corresponding feature vector data table being stored by a feature vector management table;
constructing a feature extraction training model:
a. feature vector expression capability for non-organisational and informative audio types: coding an audio signal through a recurrent neural network, wherein the recurrent neural network has a structure of 1000 input units and 500 hidden neurons;
b. the method comprises the steps of extracting the characteristics of unstructured information of text types and expressing the characteristics in a vectorization mode, wherein an algorithm is based on doc2vec, the algorithm is an extension of a *** word vector technology, and accurate characteristic capture and characteristic vector generation of the text information are achieved by adopting a wide sampling window (the sampling width is 200);
c. the feature vectorization expression capability of the unstructured information of the picture type: using a residual error network resnet-50 as a feature extraction algorithm, and outputting the feature vector through a full connection layer of the algorithm, wherein the length of the feature vector is set to be 128;
d. feature vector expression capability for unstructured information of video type: the feature vector generation of the video information needs to encode frames periodically intercepted from a video by adopting a picture-based feature vector generation technology in 3 (feature vector generation, the length of the feature vector of each frame is set to be 32, and the sampling number is 128), and then the vector set is re-encoded through a recurrent neural network, so that the feature vector corresponding to the video information is generated, and the recurrent neural network architecture for encoding is 4096 input units and 800 hidden neurons.
Training data:
a. for a feature extraction model of a text type, collecting text materials as a training data set;
b. for the feature extraction model of the picture type, pictures and classification marking information are required to be collected as training samples;
c. for audiovisual information generated by feature vectors through a recurrent neural network, the training data set is identified by tags (usually with audiovisual names or authors).
The information similarity comparison system comprises:
a. for each unstructured data, creating a feature vector data table with (feature vectors, data entities) as units in Neo4J, the table being sorted by feature vectors;
b. newly inserted information monomers need to be recorded in the feature vector data table and are inserted into corresponding positions according to the sequence of the feature vectors;
automatically establishing information association:
c. for the newly inserted information monomers I, finding k information monomers [ J0, J1, … Jk-1] with the closest similarity from the corresponding feature vector data table;
d. for similar information monomers [ J0, J1, … Jk-1], collecting an associated information set Rj;
e. adding all the associated information in Rj to the information monomer I;
the ordering of the eigenvectors is according to a standard geometric vector ordering.
The similarity comparison method is to calculate the divergence value of K L between two feature vectors, and the number K of similar monomers is usually set to 3 or 5.

Claims (10)

1. A method for constructing an adaptive knowledge graph technology based on machine learning is characterized by comprising the following steps:
the technology for constructing the knowledge graph through the machine learning technology uses a deep neural network to perform feature extraction on different types of unstructured data, and performs self-adaptive information association on continuously updated knowledge base records in a self-adaptive mode on the basis, so that the processes of information acquisition and knowledge graph construction are simplified, and automatic knowledge graph construction can be performed on large-scale data; the technology can be widely applied to various data analysis and query scenes in an intelligent application environment;
the method designs an automatic association technology for unstructured information, and through automatic association, information can automatically associate subsequently input structured/unstructured information on the basis of limited user labeled associated information to form self-adaptive construction of a knowledge graph, and specifically comprises the following steps:
step A, providing the capability of automatic feature vector generation of various types of unstructured information, including audio information, video information, text information and picture information;
b, comparing the similarity of the feature vectors to determine the capacity of similar information;
step C, introducing the same information association label to the similar information;
steps a-C are implemented by extending an existing open source or business version graph database system, the required extension modules including: a feature extraction system and a feature comparison system based on machine learning;
the feature alignment system comprises: maintaining the corresponding relation between each information monomer and the characteristic vector thereof by using a characteristic vector data table;
providing a feature vector data table for managing various unstructured information in a graph database, wherein the corresponding relation between each unstructured information and the corresponding feature vector data table is stored by a feature vector management table;
the feature comparison system provides an adaptive knowledge mapping system implementation based on a graph database Neo 4J:
for each unstructured data, creating a feature vector data table with feature vectors and data entities as units in Neo4J, wherein the table is sorted by the feature vectors;
newly inserting information monomers into the vector database Neo4J, wherein the information monomers need to be recorded in a feature vector data table, and are inserted into corresponding positions according to the sequence of feature vectors;
and finding k information monomers with the closest similarity from a corresponding feature vector data table for the information monomer I newly inserted into the vector database Neo4J, collecting a related information set Rj of the information monomers, and adding all related information in the Rj for the information monomer I, thereby realizing automatic labeling of the information monomer I.
2. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein an unstructured information oriented automatic association technology is implemented, and through automatic association, information can automatically associate subsequently input structured/unstructured information on the basis of limited user labeled associated information to form adaptive construction of a knowledge graph.
3. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein the ability of similar information is determined by similarity comparison of feature vectors.
4. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein the same information association labels are introduced to similar information.
5. The method for constructing the adaptive knowledge graph technology based on the machine learning as claimed in claim 1, characterized in that the feature extraction and vectorization expression of the unstructured information of text type are based on doc2vec, the algorithm is an extension of *** word vector technology, and the accurate feature capture and feature vector generation of the text information are realized by adopting a wide sampling window;
the sampling width of the wide sampling window is 200.
6. The method for constructing an adaptive knowledge-graph technology based on machine learning according to claim 1, wherein the capability of expressing the feature vectors of the non-mechanism and information of the audio type is as follows: the audio signal is encoded through a recurrent neural network, and the structure of the recurrent neural network is 1000 input units and 500 hidden neurons.
7. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein the feature vectorization expression capability of the unstructured information of picture types is as follows: the residual error network resnet-50 is used as a feature extraction algorithm, and is output as a feature vector through a full connection layer, and the length of the feature vector is set to be 128.
8. The method for constructing an adaptive knowledge graph technology based on machine learning according to claim 1, wherein the feature vector expression capability for unstructured information of video type is as follows: generating a feature vector of video information, namely encoding frames periodically intercepted from a video by adopting a picture-based feature vector generation technology in 3, and then re-encoding a vector set through a recurrent neural network to generate the feature vector corresponding to the video information, wherein the recurrent neural network for encoding is constructed by 4096 input units and 800 hidden neurons;
the feature vector generation technique: a feature vector is generated with a feature vector length set to 32 for each frame and a number of samples of 128.
9. The method for constructing an adaptive knowledge-graph technique based on machine learning of claim 1, wherein the ordering of feature vectors is according to a standard geometric vector ordering.
10. The method for constructing an adaptive knowledge graph technology based on machine learning of claim 1, wherein the similarity comparison method is to calculate a K L divergence value between two feature vectors, and the number K of similar monomers is usually set to 3 or 5.
CN201910722435.XA 2019-08-06 2019-08-06 Method for constructing self-adaptive knowledge graph technology based on machine learning Active CN110879843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910722435.XA CN110879843B (en) 2019-08-06 2019-08-06 Method for constructing self-adaptive knowledge graph technology based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910722435.XA CN110879843B (en) 2019-08-06 2019-08-06 Method for constructing self-adaptive knowledge graph technology based on machine learning

Publications (2)

Publication Number Publication Date
CN110879843A CN110879843A (en) 2020-03-13
CN110879843B true CN110879843B (en) 2020-08-04

Family

ID=69727426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910722435.XA Active CN110879843B (en) 2019-08-06 2019-08-06 Method for constructing self-adaptive knowledge graph technology based on machine learning

Country Status (1)

Country Link
CN (1) CN110879843B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528477A (en) * 2022-01-10 2022-05-24 华南理工大学 Scientific research application-oriented automatic machine learning implementation method, platform and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN106844723A (en) * 2017-02-10 2017-06-13 厦门大学 medical knowledge base construction method based on question answering system
CN107944898A (en) * 2016-10-13 2018-04-20 驰众信息技术(上海)有限公司 The automatic discovery of advertisement putting building information and sort method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060047441A1 (en) * 2004-08-31 2006-03-02 Ramin Homayouni Semantic gene organizer
US9606988B2 (en) * 2014-11-04 2017-03-28 Xerox Corporation Predicting the quality of automatic translation of an entire document
CN106886572B (en) * 2017-01-18 2020-06-19 中国人民解放军信息工程大学 Knowledge graph relation type inference method based on Markov logic network and device thereof
CN109697233B (en) * 2018-12-03 2023-06-20 中电科大数据研究院有限公司 Knowledge graph system construction method
CN109918478A (en) * 2019-02-26 2019-06-21 北京悦图遥感科技发展有限公司 The method and apparatus of knowledge based map acquisition geographic products data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN107944898A (en) * 2016-10-13 2018-04-20 驰众信息技术(上海)有限公司 The automatic discovery of advertisement putting building information and sort method
CN106844723A (en) * 2017-02-10 2017-06-13 厦门大学 medical knowledge base construction method based on question answering system

Also Published As

Publication number Publication date
CN110879843A (en) 2020-03-13

Similar Documents

Publication Publication Date Title
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN107766371B (en) Text information classification method and device
CN111538835B (en) Social media emotion classification method and device based on knowledge graph
CN111931061B (en) Label mapping method and device, computer equipment and storage medium
CN113590850A (en) Multimedia data searching method, device, equipment and storage medium
CN115329127A (en) Multi-mode short video tag recommendation method integrating emotional information
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN113157965B (en) Audio visual model training and audio visual method, device and equipment
CN112528010B (en) Knowledge recommendation method and device, computer equipment and readable storage medium
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN111242033A (en) Video feature learning method based on discriminant analysis of video and character pairs
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium
CN102855317A (en) Multimode indexing method and system based on demonstration video
CN111581368A (en) Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network
CN110866129A (en) Cross-media retrieval method based on cross-media uniform characterization model
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN116150404A (en) Educational resource multi-modal knowledge graph construction method based on joint learning
CN117216293A (en) Multi-mode inquiry college archive knowledge graph construction method and management platform
CN114911893A (en) Method and system for automatically constructing knowledge base based on knowledge graph
CN110879843B (en) Method for constructing self-adaptive knowledge graph technology based on machine learning
CN110674265B (en) Unstructured information oriented feature discrimination and information recommendation system
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
CN112749566B (en) Semantic matching method and device for English writing assistance
CN116977701A (en) Video classification model training method, video classification method and device
CN115599953A (en) Training method and retrieval method of video text retrieval model and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant