CN118070812A

CN118070812A - Industry data analysis method and system based on NLP

Info

Publication number: CN118070812A
Application number: CN202410473266.1A
Authority: CN
Inventors: 姚斌; 刘杰; 曹洪基; 张跃; 陈柏林
Original assignee: Shenzhen Zhongren Yinxing Information Technology Co ltd
Current assignee: Shenzhen Zhongren Yinxing Information Technology Co ltd
Priority date: 2024-04-19
Filing date: 2024-04-19
Publication date: 2024-05-24
Anticipated expiration: 2044-04-19

Abstract

The application discloses an industry data analysis method and system based on NLP, which relate to the technical field of data processing and comprise the following steps: pre-training a language model, and performing fine tuning to obtain a domain language model; obtaining a central word for the query intention input by a user; vectorizing the center word to obtain a query vector; adopting an attention mechanism to obtain an industry data retrieval result of a user query vector; constructing a domain knowledge graph; mapping the entity and the relation to a low latitude vector space by adopting a graph embedding model; obtaining entities and relations corresponding to the query vectors by adopting an attention mechanism; carrying out relationship path reasoning in the domain knowledge graph according to the query vector to obtain a relationship path of the query vector; and calculating the correlation between the query vector and the obtained entity, relation and relation path through semantic matching to obtain a retrieval result with enhanced knowledge. Aiming at the problem of low industrial data analysis efficiency in the prior art, the application improves the industrial data analysis efficiency.

Description

Industry data analysis method and system based on NLP

Technical Field

The application relates to the technical field of data processing, in particular to an industry data analysis method and system based on NLP.

Background

In recent years, artificial intelligence technology is vigorously developed, and Natural Language Processing (NLP) is taken as one important branch, so that a new thought and a new method are provided for intelligent data analysis. The NLP technology can simulate the language understanding capability of human beings, and automatically extract semantic information contained in text data, so that data analysis is not limited to literal matching any more, but the inherent semantics of the data are truly understood. The NLP-based data analysis method can overcome the defects of the traditional method and realize intelligent understanding and mining of data.

In particular to the field of industry data analysis, although the NLP technology has wide application prospect, how to combine the NLP technology with industry characteristics to construct an effective data analysis method still faces a plurality of technical challenges. The industry data has the characteristics of large data volume, complex business logic, multiple technical terms and the like, and a general NLP model is difficult to accurately understand industry semantics and needs to be adapted and optimized. In addition, how to integrate industry knowledge into the data analysis process, and to enhance the accuracy and comprehensiveness of semantic understanding, is also a problem to be solved urgently.

In related art, for example, in CN117648926a, a method and a system for automatically creating a data model based on natural language are provided, which relate to the technical field of data processing, and include: storing all table names and field names in a data source and an industry model library into a first branch library and a second branch library respectively, and storing the first branch library and the second branch library into a first vector database respectively after vectorizing; storing each table in the industry model library into a graph database; the first and second word segmentation databases are used for respectively carrying out word segmentation processing on the service demand information, extracting keyword information and forming a center word; vectorizing the center word, then respectively searching data source field information and industry model library field information matched with the center word in a first vector database and a second vector database, and finding out table information in the data source and table information in the industry model library; and removing parts which are not related in the graph database based on field information and table information of the industry model library to obtain an industry data model corresponding to the business demand information. In the scheme, after vectorization, the center word is required to be searched in a first vector database and a second vector database to match data source field information and industry model library field information; however, when the data volume is huge, the vectorization and database retrieval have the problems of high computational complexity and low query efficiency, so the data analysis efficiency of the scheme needs to be further improved.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problem of low industrial data analysis efficiency in the prior art, the application provides an industrial data analysis method and system based on NLP, which improves the industrial data analysis efficiency by pre-training a language model and performing field fine adjustment, combining with a field knowledge graph and the like.

2. Technical proposal

The aim of the application is achieved by the following technical scheme.

One aspect of the embodiments of the present disclosure provides an industry data analysis method based on NLP, including: pre-training the language model, and fine-tuning the pre-trained language model by adopting marking data in an industry model library to obtain a field language model; inputting metadata of the input industry data into the obtained domain language model, extracting semantic features of the metadata to obtain semantic representation vectors of the metadata, wherein the metadata comprises table names and field names; acquiring a central word of the query intention of the user by using a named entity recognition algorithm for the query intention input by the user; carrying out vectorization processing on the central word by adopting the obtained domain language model to obtain a query vector; adopting an attention mechanism, and obtaining a matching score by calculating the similarity between the query vector and the semantic representation vector of the metadata of the industry data; sorting the semantic representation vectors of the candidate metadata according to the matching score to obtain an industry data retrieval result of the user query vector; constructing a domain knowledge graph by utilizing entities and relations in an industry model library; mapping the constructed domain knowledge graph to a low latitude vector space by adopting a graph embedding model to obtain a distributed vector representation of the entity and the relationship; carrying out semantic matching on the obtained query vector and the obtained distributed vector representation of the entity and the relation by adopting an attention mechanism to obtain the entity and the relation corresponding to the query vector; carrying out relationship path reasoning in the domain knowledge graph according to the query vector to obtain a relationship path of the query vector; and calculating the correlation between the query vector and the obtained entity, relation and relation path through semantic matching, and combining the industry data retrieval result to obtain a knowledge enhanced retrieval result as a final industry data analysis result.

The industry model library is a set of resources such as a series of pre-training models, labeling data, knowledge bases and the like constructed by a pointer to a specific industry or field. The method is an important infrastructure in industry data analysis, and provides necessary data support for tasks such as fine adjustment of a domain language model, construction of a knowledge graph and the like. The main components of the industry model library include: pre-training language model: the language model obtained by pre-training on the corpus of large-scale industries such as BERT, GPT and the like is specific to the specific industries or fields. These models capture industry-specific language patterns and semantic information, which can be used as a base model for subsequent tasks. Labeling data: industry data are marked by industry experts or marking personnel to form a data set with labels. The labeling data can be used for the tasks of fine tuning of language models, named entity identification, relation extraction and the like, and the performance of the models on industry data is improved. Knowledge base: and collecting and arranging the structured knowledge of the related entities, relations, facts and the like of the industry to form an industry knowledge base. The knowledge base can be used for constructing a domain knowledge graph and supporting knowledge representation learning and reasoning. Industry dictionary: and collecting industry professional terms, abbreviations, synonyms and the like, and constructing an industry dictionary. The dictionary can be used for tasks such as vocabulary normalization and entity linking, and accuracy of data analysis is improved. Industry rules: summarizing empirical rules, business logic, etc. in industry data analysis to form an interpretable rule base. The rule base may be used to guide the data analysis flow, providing interpretation and decision support.

The relationship path reasoning means that entities and relationship paths related to query are automatically searched and mined in a knowledge graph according to given initial entities and query conditions. It is a common way of knowledge-graph reasoning, and aims to find implicit semantic association and expand the result space of the query. Comprising the following steps: and expressing the query condition as a query vector, and finding out an entity which is semantically related to the query vector in the knowledge graph as an initial node in a semantic matching mode. And starting from the initial node, adopting a graph traversal algorithm such as depth-first search, width-first search and the like to explore and find a relation path related to the query in the knowledge graph. In the traversal process, the most relevant and reliable relation path is selected as an reasoning result by considering factors such as the type, the direction and the weight of the relation. And sequencing and filtering the relationship paths obtained by reasoning to generate a final reasoning result which is used as supplement and expansion of the query.

Further, roBERTa models are adopted as language models; wherein RoBERTa (Robustly Optimized BERT Pretraining Approach) is a pre-trained language model based on BERT (Bidirectional Encoder Representations from Transformers) improvements. In the industry data analysis method based on NLP, roBERTa is adopted as a basic language model, so that the powerful language understanding and expression capability can be fully utilized. By fine tuning RoBERTa on the industry corpus, a domain language model for a specific industry can be obtained, and the specific language mode and semantic information of the industry can be captured. The field language model can be used for various industry data analysis tasks such as solid identification, relation extraction, text classification and the like, and the accuracy and the efficiency of analysis are improved. In addition, the language representation vector RoBERTa generated can be used as an input of knowledge representation learning for constructing an industry knowledge graph. By combining the linguistic representation of RoBERTa with the structured representation of the knowledge-graph, deeper, more comprehensive industry data analysis and mining can be achieved.

Further, fine tuning the pre-trained language model includes: constructing a training set by using the marking data in the industry model library; randomly shielding part of the word elements in each text data in the training set, and predicting the shielded word elements by using the context information to obtain word element prediction data taking the shielded word elements as labels as a first training set; randomly selecting two text fragments from the training set, if the two text fragments are adjacent in the original text, constructing a positive sample pair, otherwise, constructing a negative sample pair to obtain judgment data by taking whether the text fragments are adjacent as labels as a second training set; and inputting the first training set and the second training set which are constructed into a pre-trained RoBERTa model, and performing fine adjustment on parameters of the RoBERTa model by adopting an Adam algorithm and minimizing a loss function according to the preset batch size and iteration round number.

Wherein the negative sample pair (NEGATIVE SAMPLE PAIR) refers to two text segments randomly selected in the training set that are not adjacent in the original text, i.e., there is no direct context. The negative pair is constructed to train the language model to determine the relevance and consistency between text fragments. In fine tuning RoBERTa the second training set is to train the model to determine whether the given two text segments are adjacent in the original text. By constructing positive sample pairs (adjacent text segments) and negative sample pairs (non-adjacent text segments), the model can learn the context and semantic consistency between text segments. In the training process, the model learns the relativity discrimination capability between text fragments by minimizing the loss function of the positive sample pair and the negative sample pair. This discriminant capability can help the model better understand the global semantics of the text, capturing the consistency of the context, and thus improving the performance of the model in downstream tasks. Negative sample pairs are important concepts in a fine-tuning pre-training language model that help the model learn relevance discrimination and global semantic understanding between text segments by building non-adjacent text segment pairs. Reasonable negative sample pair construction and utilization can obviously improve the performance of the model in downstream tasks, especially in tasks such as text continuity judgment, chapter relation recognition and the like. In the industry data analysis method based on NLP, the thought of negative sample pairs is introduced, so that the model can be helped to better understand the semantic structure and association relation of the industry text, and the accuracy and comprehensiveness of data analysis are improved.

Further, extracting a semantic representation vector of metadata includes: carrying out named entity recognition on metadata of industry data to obtain a text sequence of the metadata; dividing a text sequence into word elements with sub word levels by adopting Byte Pair Encoding algorithm, counting the occurrence frequency of co-occurrence byte pairs in the word elements with the sub word levels, and merging the byte pairs with the occurrence frequency larger than a threshold value to obtain a word element table of the domain language model; according to the word list, mapping each word in the text sequence of the metadata into a unique numerical ID to obtain a numerical word ID sequence; converting the word element ID sequence into a corresponding word element embedded vector sequence serving as an input characteristic of the domain language model; generating a position vector of a corresponding dimension according to the position information of the word element in the word element ID sequence by adopting a sine type position coding method; embedding the position information into the input features through the addition operation of the position vector and the word element embedding vector sequence; the input features of the embedded position information are used as input vectors of the domain language model.

The Byte Pair Encoding (BPE) algorithm is a data compression algorithm, which was originally used to losslessly compress text data. In the field of natural language processing, a BPE algorithm is used for constructing a subtended word segmentation method, so that unregistered words and rare words can be effectively processed. High-frequency sub-words in the text sequence, such as affix, root, etc., can be automatically identified by the BPE algorithm and used as independent lemmas. The method for tokenizing the sub-word level can effectively solve the problems of unregistered words and rare words and improve the generalization capability of the language model. The basic idea of the BPE algorithm is: regarding the text sequence as a sequence of byte pairs, merging high-frequency byte pairs into new byte pairs by counting the occurrence frequency of the byte pairs, and continuously repeating the process until reaching the preset word list size or the maximum length of the byte pairs.

The sinusoidal position coding (Sinusoidal Position Encoding) method is a method of embedding the position information of the word element into the word element representation, and is commonly used in models based on self-attention mechanisms such as a transducer. In natural language processing tasks, the location information of the tokens is important for understanding the semantics and structure of the text. For example, in the sentence "THE CAT SAT on the mat," the positional relationship of "cat" and "sat" determines the relationship of the subject and predicate. The basic idea of the sinusoidal position coding method is: position vectors of different frequencies and phases are generated using sine and cosine functions, and position information is encoded into different dimensions of the lexeme representation. And adding the sine position coding vector and the word element embedding vector to obtain a final input feature vector. The position coding method can effectively introduce the position information into the language model, and improve the modeling capability of the model on the sequence structure and the sequence relation.

Further, extracting the semantic representation vector of the metadata further includes: inputting the obtained input vector into a fine-tuned RoBERTa model, wherein the RoBERTa model comprises a multi-layer transducer encoder; in each layer of the Transformer encoder of the RoBERTa model, a contextual semantic representation of the metadata is obtained by: calculating the attention weight among the words in the input vector by utilizing a multi-head self-attention mechanism, and acquiring the dependency relationship among the words; the attention weight matrix and the input vector are used for weighted summation to obtain an output vector of the multi-head self-attention mechanism; carrying out residual connection on an input vector and an output vector, and obtaining a normalized representation vector through layer normalization; nonlinear transformation is carried out on the normalized representation vector by utilizing a feedforward neural network, so as to obtain a transformed vector; residual connection is carried out on the transformed vector and the normalized representation vector, and the output vector of the current transducer encoder is obtained through layer normalization; and taking the output vector of the current transducer encoder as the input vector of the next layer of transducer encoder, repeating the steps until the last layer of transducer encoder of RoBERTa models, and obtaining the context semantic representation vector of the metadata as the hidden layer characteristic representation of the metadata.

Wherein the transducer encoder is a core component of the transducer model for converting the input sequence into a contextual semantic representation. The transducer encoder is formed by stacking a plurality of sub-layers with the same structure, and each sub-layer comprises two main parts: multi-headed self-attention mechanisms and feed forward neural networks.

The hidden layer characteristic representation refers to a characteristic representation obtained in an intermediate layer (hidden layer) of a neural network model after input data is subjected to a series of nonlinear transformations. These hidden-layer feature representations typically contain high-level, abstract semantic information of the input data. In a transform encoder, the output vector of each sub-layer can be regarded as a hidden layer feature representation of the input sequence. As the transform encoder sublayers are stacked, hidden layer feature representations become increasingly abstract and advanced, capturing the global semantic information of the input sequence.

Further, extracting the semantic representation vector of the metadata further includes: obtaining a fixed dimension semantic representation vector of the metadata through an output vector of a last layer of a transducer encoder of RoBERTa model; carrying out L2 normalization on the obtained fixed-dimension semantic representation vector to obtain a normalized semantic representation vector; randomly generating a plurality of hash functions, wherein each hash function corresponds to one binary bit; for each dimension characteristic of the normalized semantic representation vector, calculating a corresponding hash value by adopting a hash function; combining the hash value obtained by each hash function into a binary hash code as a semantic hash representation of metadata; taking the numerical ID of the metadata as a key, and storing the corresponding semantic hash representation as a value in a hash table; and ordering key value pairs in the hash table according to the semantic hash table, constructing a semantic index, and taking the obtained semantic index as a final semantic representation vector of the metadata.

Wherein L2 normalization is a method of vector normalization for scaling the length of the vector to 1 while keeping the direction of the vector unchanged. L2 normalization is also known as euclidean normalization or unit vector normalization. In the scheme for extracting the metadata semantic representation vector, the L2 normalization is carried out on the fixed-dimension semantic representation vector output by the RoBERTa model, so that the semantic representation vectors of different metadata can be unified to the same scale, and subsequent semantic hash and index construction are facilitated.

Wherein the hash function is a function that maps an arbitrary length input to a fixed length output. The output of the hash function is referred to as a hash value or hash code, typically a fixed length binary string or integer. In the scheme for extracting the metadata semantic representation vector, a plurality of randomly generated hash functions are adopted to hash-encode the normalized semantic representation vector. Each hash function maps each dimension characteristic of the semantic representation vector into a binary bit, and the results of the hash functions are combined into a binary hash code with a fixed length. Semantic indexing is an index structure that organizes and retrieves data based on semantic similarity. Unlike traditional keyword-based indexing, semantic indexing considers semantic content and similarity of data, and can achieve more intelligent and accurate information retrieval. In the scheme of extracting the metadata semantic representation vector, a hash table is constructed by taking the numerical ID of the metadata as a key and the corresponding semantic hash representation as a value. And then, sorting key value pairs in the hash table according to the semantic hash representation to obtain a semantic index.

Further, obtaining the center word of the query intention includes: inputting the query intention input by the user into BiLSTM-CRF model, extracting the context characteristics of the query intention of the user through the bidirectional LSTM layer; according to the context characteristics, a conditional random field CRF layer is adopted to name and label each word in the query intention, so that entity words in the query intention are obtained; calculating the relevance weight of each entity word in the query intention and the query intention representation by using an attention mechanism to obtain the attention weight of each entity word; multiplying each entity word in the query intention by the corresponding attention weight to obtain a weighted entity word vector; and weighting the entity word vectors to obtain aggregated entity expression vectors serving as central word expression vectors of the query intention.

The BiLSTM-CRF model is a neural network model for sequence labeling tasks and is commonly used for the tasks of named entity identification, part-of-speech labeling and the like. The method combines the advantages of a two-way long-short-term memory network (BiLSTM) and a Conditional Random Field (CRF), and can simultaneously consider the dependency relationship between the context information of the sequence and the labels. In the scheme of extracting the central word of the query intention, biLSTM-CRF model is used for carrying out named entity recognition on the query intention and recognizing the entity word in the query intention. Extracting the contextual characteristics of the query intention through the BiLSTM layer, and carrying out entity labeling on each word through the CRF layer to obtain the entity word in the query intention.

Wherein, the center word refers to the word which can express the core semantic meaning or theme in the text or the query intention. In information retrieval and natural language processing tasks, the central word plays a key role, and can summarize the main content of text or query, and embody the core intention of a user. Entity words that best express the core semantics can be extracted from the query intent and represented as a vector. The center word expression vector can be used for subsequent tasks such as query intention understanding, semantic matching and the like, and the accuracy of query understanding and retrieval is improved.

Further, mapping entities and relationships to a low latitude vector space includes: for each triplet in the domain knowledge graph, randomly initializing embedding vectors of the head entity, the relation and the tail entity to obtain an initial entity embedding vector and a relation embedding vector; defining TransE energy functions of the model as Euclidean distance between the sum of the head entity embedded vector and the relation embedded vector and the tail entity embedded vector; adopting a negative sampling method, and randomly replacing a head entity or a tail entity for each positive sample triplet in the domain knowledge graph to generate a corresponding negative sample triplet; according to the defined energy function, calculating an energy function value of the positive sample triplet and an energy function value of the corresponding negative sample triplet; using a gradient descent algorithm to minimize the difference between the energy function of the positive sample triplet and the energy function value of the negative sample triplet, and updating the entity embedding vector and the relation embedding vector; repeating the steps until TransE models are converged, and obtaining entity embedded vectors and relation embedded vectors after all entities and relations in the domain knowledge graph are mapped to the low latitude vector space.

Wherein TransE (Translating Embedding) is a model for knowledge-graph embedding. The TransE model maps entities and relationships to the same low-dimensional vector space and uses translation operations to model relationships between entities. In the present application, transE models are used to map entities and relationships in a domain knowledge graph to a low-dimensional vector space. By minimizing the energy function difference of the positive and negative sample triples, the embedded vectors of entities and relationships are learned such that the head entity is as close to the tail entity as possible after the relationship is translated in vector space.

Wherein the embedded vector (Embedding Vector) is a representation of discrete variables (e.g., words, entities, etc.) mapped to a continuous low-dimensional vector space. In the embedded vector space, semantically similar entities or relationships are mapped to similar locations, while semantically different entities or relationships are mapped to distant locations. According to the application, for each entity and relation in the domain knowledge graph, the embedded vector is randomly initialized, and then the embedded vector is continuously updated and optimized through the training process of TransE model, so that the semantic information of the entity and relation in the low-dimensional vector space can be accurately represented. Negative sampling (NEGATIVE SAMPLING), among other things, is a technique for accelerating training and improving model generalization ability, commonly used to process large-scale datasets and unbalanced datasets. In the application, a negative sample triplet is generated by adopting a negative sampling method and is used for training TransE models together with a positive sample triplet. By minimizing the energy function difference between the positive and negative samples, the model is enabled to learn more accurate and generalized entity and relationship embedding vectors.

The Energy Function (Energy Function) is a Function for measuring the difference between the model predicted result and the real result. In the knowledge graph embedding task, the energy function is used to evaluate the rationality or likelihood of a triplet. The aim of the TransE model is to minimize the energy function value of the positive sample triplet while maximizing the energy function value of the negative sample triplet so that the energy function value of the positive sample triplet is as small as possible and the energy function value of the negative sample triplet is as large as possible. In the TransE model, the energy function is defined as the Euclidean distance between the sum of the head entity embedding vector and the relation embedding vector and the tail entity embedding vector. By minimizing the loss function, the TransE model can learn reasonable entity and relationship embedding vectors, so that the energy function value of the positive sample triplet is as small as possible and the energy function value of the negative sample triplet is as large as possible in the vector space, thereby realizing effective representation of the entity and relationship in the knowledge graph.

Further, obtaining a relationship path of the query vector includes: taking the obtained central word representation vector as a starting node, carrying out random walk in a domain knowledge graph, and generating a plurality of random walk sequences, wherein each random walk sequence consists of nodes and edges; for each random walk sequence, encoding the node sequence by adopting a bidirectional LSTM (least squares) to obtain node embedded vectors of the random walk sequences; carrying out weighted aggregation on the obtained node embedded vectors by adopting an attention mechanism to obtain the aggregated random walk sequence embedded vectors as relationship path expression vectors; splicing the obtained relation path expression vector, the obtained entity embedded vector, the relation embedded vector and the query vector to obtain a spliced vector; inputting the spliced vector into DeepPath model, calculating DeepPath model output through forward propagation, and optimizing DeepPath model parameters through backward propagation algorithm; repeating the steps until the loss function value of the DeepPath model is smaller than a threshold value, and obtaining an optimized DeepPath model; and inputting the query vector into the optimized DeepPath model, and obtaining an embedded vector of the relation path of the query vector through forward propagation.

The random walk sequence refers to a node sequence generated by starting from a starting node and randomly accessing adjacent nodes according to a certain transition probability in a graph structure (such as a knowledge graph). In a knowledge graph, the random walk sequence may capture semantic relationships and structural information between entities. By multiple random walks, a plurality of different random walk sequences may be generated. These sequences contain multi-hop relationships and contextual information between entities in the knowledge-graph, helping to capture semantic associations between queries and entities. In the application, a central word representation vector of a query is used as an initial node, and random walk is performed in a domain knowledge graph to generate a plurality of random walk sequences. Each sequence is composed of nodes and edges, representing the relationship paths between the query and the entities in the knowledge-graph.

Wherein DeepPath is a deep learning-based relationship path inference model. The DeepPath model utilizes a deep neural network to learn a semantic matching mode between the query and the relation path, so that the query answer prediction based on the path is realized. The DeepPath model may be used to predict the probability of a match between a query and a relationship path. Given a query vector, a relationship path matched with the query vector can be found through DeepPath model, so that path-based query answer reasoning is realized.

The relationship path refers to a series of sequences of relationships connecting two entities in a knowledge graph. The relationship path represents a semantic association between entities and a chain of reasoning that can be used to answer complex logical queries. In the application, a plurality of relation paths are generated in a domain knowledge graph through random walk, and each path consists of nodes and edges. The node sequence is then encoded using the bi-directional LSTM, resulting in an embedded representation of the relationship path. And aggregating node embedding through an attention mechanism to obtain a final relationship path expression vector. The representation vector of the relationship path is input into the DeepPath model along with the query vector, the entity-embedded vector, and the relationship-embedded vector for learning a semantic matching pattern between the query and the relationship path. By optimizing DeepPath models, the matching probability of the query and the relation path can be obtained, and the query answer reasoning based on the path is realized.

Another aspect of the embodiments of the present disclosure further provides an NLP-based industry data analysis system for performing an NLP-based industry data analysis method of the present application. And a data preprocessing module: and carrying out pretreatment operations such as cleaning, labeling, feature extraction and the like on the industry data. The domain knowledge graph construction module: and constructing a domain knowledge graph according to the preprocessed industry data. Query intent understanding module: and carrying out intention understanding on the user query by adopting BiLSTM-CRF model, and extracting the central word in the query. Entity and relationship embedding module: the entities and relationships in the knowledge-graph are mapped to a low-dimensional vector space using a TransE model. A relationship path reasoning module: and learning a semantic matching mode between the query and the relationship path through random walk and DeepPath models, and generating the relationship path embedding of the query. Data analysis and visualization module: and analyzing and visually displaying the industry data by utilizing the query intention understanding result and the relationship path reasoning result.

3. Advantageous effects

Compared with the prior art, the application has the advantages that:

The pre-training language model is subjected to field fine adjustment, and the language model is trained by using the marking data in the industry field, so that the language model can better understand and represent semantic information in the industry data, and the accuracy and depth of semantic understanding of the industry data are improved;

Extracting semantic representation vectors of industry data metadata by utilizing the trimmed domain language model, and comprehensively describing semantic features of the metadata by combining various feature extraction technologies such as Byte Pair Encoding, position coding and the like, so that high-quality semantic representation is provided for subsequent semantic matching and retrieval;

The central word of the user query intention is accurately acquired by adopting a named entity recognition algorithm, and vectorization processing is carried out on the central word by utilizing a domain language model, so that a system can accurately understand the query requirement of the user, and the search precision and recall ratio are effectively improved;

The attention mechanism is introduced to perform semantic relevance calculation on the query vector and the metadata semantic representation vector, so that the limitation of the traditional keyword matching method is overcome, intelligent matching based on semantics is realized, and the search result is more in line with the real intention of a user;

knowledge representation learning and reasoning are carried out by fusing domain knowledge graphs, low-dimensional semantic representation of entities and relations is learned by utilizing a graph embedded model, and implicit semantic information related to inquiry is obtained through relation path reasoning, so that the breadth and depth of data analysis are greatly improved;

The semantic information obtained by knowledge graph reasoning is fused with the data retrieval result to obtain a knowledge-enhanced semantic retrieval result, so that the retrieval accuracy is ensured, the retrieval coverage range is expanded, and the industry data analysis is more comprehensive and accurate.

Drawings

FIG. 1 is an exemplary flow chart of a method of NLP-based industry data analysis according to some embodiments of the present description;

FIG. 2 is an exemplary flow chart for fine tuning RoBERTa model parameters shown in accordance with some embodiments of the present description;

FIG. 3 is an exemplary flow chart for generating RoBERTa model input vectors, shown in accordance with some embodiments of the present description;

FIG. 4 is an exemplary flow chart for obtaining a hidden layer feature representation according to some embodiments of the inventions described herein;

FIG. 5 is an exemplary flow diagram of obtaining semantic representation vectors according to some embodiments of the present description;

FIG. 6 is an exemplary flow chart for capturing a center word according to some embodiments of the present description;

FIG. 7 is an exemplary flow chart of mapping entities and relationships to low latitude vector space, according to some embodiments of the present description;

FIG. 8 is an exemplary flow chart for obtaining embedded vectors of a relationship path, shown in accordance with some embodiments of the present description.

Detailed Description

The method and system provided in the embodiments of the present specification are described in detail below with reference to the accompanying drawings.

FIG. 1 is an exemplary flow chart of an NLP-based industry data analysis method according to some embodiments of the present description, comprising: pre-training the language model, and fine-tuning the pre-trained language model by adopting marking data in an industry model library to obtain a field language model; inputting metadata of the input industry data into the obtained domain language model, extracting semantic features of the metadata to obtain semantic representation vectors of the metadata, wherein the metadata comprises table names and field names; acquiring a central word of the query intention of the user by using a named entity recognition algorithm for the query intention input by the user; carrying out vectorization processing on the central word by adopting the obtained domain language model to obtain a query vector; adopting an attention mechanism, and obtaining a matching score by calculating the similarity between the query vector and the semantic representation vector of the metadata of the industry data; sorting the semantic representation vectors of the candidate metadata according to the matching score to obtain an industry data retrieval result of the user query vector; constructing a domain knowledge graph by utilizing entities and relations in an industry model library; mapping the constructed domain knowledge graph to a low latitude vector space by adopting a graph embedding model to obtain a distributed vector representation of the entity and the relationship; carrying out semantic matching on the obtained query vector and the obtained distributed vector representation of the entity and the relation by adopting an attention mechanism to obtain the entity and the relation corresponding to the query vector; carrying out relationship path reasoning in the domain knowledge graph according to the query vector to obtain a relationship path of the query vector; and calculating the correlation between the query vector and the obtained entity, relation and relation path through semantic matching, and combining the industry data retrieval result to obtain a knowledge enhanced retrieval result as a final industry data analysis result.

FIG. 2 is an exemplary flow chart for fine tuning RoBERTa model parameters, according to some embodiments of the present description, step S11: and constructing a training set by using the labeling data in the industry field. Labeling data is selected from corpus in target industries such as finance, medical treatment, law and the like, and is subjected to cleaning, denoising and formatting treatment to obtain high-quality training corpus. Step S12: a first training set for a word-element prediction task is constructed. For each text data in the training set, 15% of the lemmas are randomly blocked and replaced with special symbols [ MASK ]. And taking the original word as a label, taking word prediction as a training target, and generating a first training set. Specifically, let The original text be "The company's revenue increased by% last year", "The company's [ MASK ] INCREASED BY% [ MASK ] year", "revenue" and "last" as labels, and form a training sample. Step S13: a second training set for text matching tasks is constructed. Two text fragments are randomly selected from the training corpus, and are marked as positive samples if the text fragments are adjacent to each other in front of and behind the original corpus, and are marked as negative samples if the text fragments are not adjacent to each other in front of and behind the original corpus. Positive and negative sample ratio is 1: and 1, generating a second training set. For example, two fragments, "The company's revenue increased by%" and "last year" were chosen, which, due to their proximity, constituted a positive sample. The "The company's revenue increased by%" and "The CEO announced The acquisition" are not adjacent, constituting a negative sample.

Step S14: the RoBERTa model is fine-tuned using the first training set and the second training set. The batch size was set to 32 and the number of training rounds was 5. Inputting the training set into RoBERTa model, forward propagating to calculate output, then calculating loss function, and adopting Adam optimization algorithm to update model parameters. The loss function of the word element prediction task is a cross entropy function, and the loss function of the matching task is a contrast loss function. And continuously iterating until the stopping condition is met, and obtaining the fine-tuned RoBERTa model. Step S15: and extracting the characteristics of the industry text by using the trimmed RoBERTa model. And inputting the industry data into the model, and extracting the output of the last hidden layer as a semantic feature vector of the text for subsequent tasks such as semantic matching, text classification and the like.

FIG. 3 is an exemplary flow chart for generating RoBERTa model input vectors, step S21, shown in accordance with some embodiments of the present description: and extracting metadata from the structured data in the industry database. Metadata information such as table names, field names and the like of the data table is obtained through SQL sentences or a graphical interface tool, and a metadata text sequence is formed. For example, the "user account table" and its fields "account ID", "username", "account balance" are extracted from the financial database, resulting in a metadata text sequence "user account table account ID username account balance".

And BiLSTM-CRF model input is metadata text sequence, and output is entity label corresponding to each word element in the text. The model is mainly composed of three parts: word embedding layer, biLSTM coding layer and CRF decoding layer. Word embedding layer, for metadata text sequence x= { x_1, x_2, &.. each word element x_i is first mapped to a word vector e_i of fixed dimension by a pre-trained word embedding matrix. The word vector may express semantic information of the word element. Let the word vector dimension be d _ E, a word embedding matrix e= { e_1, e_2, &..once again, e_n }, of size n x d_e, is obtained.

BiLSTM the coding layer inputs the word embedding matrix E into a bidirectional LSTM network, and extracts the context characteristics of the word elements. BiLSTM consists of forward LSTM and backward LSTM, capturing left and right context information of the lemma, respectively. For the ith term, its hidden state is expressed as: h_i= [ h_fi; h_bi ], where h_fi is the hidden state of the forward LSTM at position i, and h_bi is the hidden state of the backward LSTM at position i. And splicing the forward hidden states and the backward hidden states to obtain a context representation h_i of the word element x_i, wherein the dimension is 2d_h.

CRF decoding layer to consider constraint relation and transfer rule between entity labels, a Conditional Random Field (CRF) decoding layer is constructed above BiLSTM layers. The CRF layer models the global probability of the tag sequence and finds out the globally optimal tag path. Let the entity tag set be l= { l_1, l_2, & gt. The parameters of the CRF layer are a transfer matrix a and an emission matrix f, where a_ { ij } represents the transfer score from tag i to tag j and f_ { i, j } represents the emission score that labels the ith term as the jth tag. For the tag sequence y= { y_1, y_2..the term, y_n, its global score is: score (X, y) = \sum_ { i=1 } n f _ { i, y_i } + \sum_ { i=1 } { n-1} a_ { y_i, y_ { i+1}, and decoding by Viterbi algorithm to find the highest-score tag sequence y as the optimal entity labeling result.

In the model training stage, a BiLSTM-CRF model is trained by using the marked metadata text sequence, and a word embedding matrix, biLSTM parameters, a CRF transfer matrix and a transmitting matrix are learned. Negative log likelihood loss optimization model parameters are adopted: loss = - \sum { i=1 } N log p (y_i|x_i), where N is the number of training set samples and (x_i, y_i) is the ith training sample and its tag sequence. Model parameters are updated by minimizing the loss function through back propagation and gradient descent algorithms. In the reasoning stage, the metadata text sequence to be identified is input into a trained BiLSTM-CRF model, and the optimal tag sequence is obtained through Viterbi decoding, so that the key entity is identified. Taking "user account Table account ID user name account balance" as an example, through model identification, the output result is "B-Table I-Table B-Field I-Field B-Field I-Field", which indicates that "user account" is a Table name entity, "account ID", "user name" and "account balance" are Field entities. Through BiLSTM-CRF named entity recognition, key entities representing table names and fields in metadata texts are extracted accurately, redundant information is removed, semantic emphasis of metadata is further highlighted, high-quality input is provided for subsequent metadata semantic representation, and accuracy of semantic modeling is improved.

The metadata text is word-segmented using Byte Pair Encoding (BPE) algorithm. The BPE algorithm is a sub-word segmentation method based on statistics, and a sub-word list is constructed through combination of byte pairs, so that the size of a vocabulary list can be ensured, and meanwhile, the word element information can be reserved to the maximum extent. First, all text in the metadata corpus is converted into a byte-level representation. The frequency of occurrence of each successive byte pair (byte pair) in the corpus is then counted. For example, for the text "user account table", its byte pairs include "user", "account", "user table", and the like. The byte pair with the highest occurrence frequency is selected and combined into a new subword. For example, if "account" is the most frequently occurring byte pair, all "accounts" in the corpus are merged into one new subword "account". After merging, the text representations in the corpus are updated and the byte pair frequencies are re-counted. And continuously selecting the byte pair with the highest frequency to combine until the preset sub-vocabulary size or the frequency threshold is reached. After each merging, a new subword is added to the vocabulary, and the text representation is updated accordingly. And utilizing the constructed sub word list to segment the metadata text. Specifically, a greedy matching strategy is adopted, text is scanned from left to right, and the longest sub-word which can be matched is found and used as a word element. The process is repeated until the text is completely segmented into sequences of subwords. There are the following metadata text corpus: the user account table account ID user name account balance, the order record table order ID user ID order amount order status, the commodity information table commodity ID commodity name commodity price commodity inventory.

First, text is converted into a byte-level representation: "user account table account I D user name account balance", "order record table order I D user I D order gold amount order status", "commodity information table commodity I D commodity name commodity price commodity library". The frequency of byte pairs is counted, and the occurrence frequency of an account, an order, a commodity and the like is found to be high. Assuming that "account" is most frequent, it is merged into a new subword: the user account table account ID user name account balance, the order record table order ID user ID order amount order status, the commodity information table commodity ID commodity name commodity price commodity inventory. The above process is repeated, merging high frequency byte pairs until a preset sub-vocabulary size, e.g., 5000, is reached. The resulting subword may contain common business entity words such as "user," "account," "order," "merchandise," "information," "name," "ID," etc.

And in the word segmentation stage, matching and segmenting the metadata text by utilizing the subword table. For example, the "user account table" is divided into "user account table" and the "account balance" is divided into "account balance" to obtain the sub word sequence. Through the BPE algorithm, under the condition of ensuring that the vocabulary size is controllable, a sub-vocabulary in the industry field is constructed, and most common business entities and semantic components are covered. Compared with word segmentation at the character level or word level, the segmentation mode based on the sub words achieves better balance between precision and recall rate, and provides optimal word element granularity for subsequent metadata semantic representation. Meanwhile, the size of the sub word list is also obviously smaller than that of the full word list, so that the dimension of the word embedding matrix is reduced, and the model training and reasoning efficiency is improved.

And mapping the word elements in the metadata text into unique numerical IDs according to the sub word list generated by the BPE. For example, "user account" maps to 23 and "account ID" maps to 198. This converts the text into a sequence of digitized word IDs. The word ID sequence is converted into a word embedding vector matrix using a pre-trained word embedding matrix. Word embedding is a technique that maps the tokens to low-dimensional dense vectors, by which semantic relationships between the tokens can be characterized. Common Word embedding models include Word2Vec, gloVe, etc., which learn a distributed representation of a Word element by unsupervised training on a large corpus. In this embodiment, gloVe (Global Vectors for Word Representation) pre-training word embedding matrices are selected for use. GloVe learn word embedding through global word co-occurrence statistics, and the semantic features of the word can be better depicted by considering local context window and global corpus statistics information. Using GloVe word embedding vectors of dimension 768, the word embedding matrix is denoted E εR+{ |V|×768}, where |V| is the size of GloVe vocabulary, typically on the order of hundreds of thousands to millions. For the word element ID sequence { w_1, w_2, &..once the metadata text sequence is processed in step S24, w_n }, each word element w_i corresponds to a unique ID number. Each token ID is mapped into 768-dimensional word embedding vectors using a word embedding matrix E: e_i=e [ w_i ], i=1, 2, & gt..i.. N, where e_i represents the word embedding vector of the word element w_i and E [ w_i ] represents 768-dimensional vector taken from row w_i of the word embedding matrix. Through the mapping operation, the word ID sequence with the length of n is converted into a word embedding vector matrix with the size of n×768: e_seq= [ e_1, e_2, ], e_n ] ≡. The word embedding vector matrix E_seq not only retains the word meta information of the metadata text sequence, but also introduces rich semantic information.

In E_seq, each row corresponds to a semantic vector representation of a token, which is closer in the embedding space. For the metadata metaid sequence "[1023, 758, 3197, 981, 1054]", where 1023 corresponds to "user", 758 corresponds to "account", 3197 corresponds to "table", 981 corresponds to "ID",1054 corresponds to "balance". Each token ID is converted into 768-dimensional word embedding vectors by GloVe word embedding matrices: e_1023=e [1023] = [0.23, -0.11, ], 0.09,

E_758=e [758] = [0.35,0.27, -0.18], e_1054=e [1054] = [0.19,0.08 ], 0.31. Finally, a word embedding vector matrix ：[0.23,-0.11,......,0.09],[0.35,0.27,......,-0.18],E_seq=[...,......,...,...],[0.42,0.13,......,-0.23],[0.19,0.08,......,-0.31]. ×768 is obtained, the word is embedded through GloVe pre-training words, and the discrete word element ID sequence is mapped to a continuous vector space, so that the words with similar semantics are closer in the embedding space, and the words with irrelevant semantics are farther in the space.

A position vector is generated for each token using a sinusoidal position coding method. In metadata semantic representation tasks, the order and location information of the tokens has an important role in understanding the semantics of the entire text sequence. For example, the "user account table" and "account user table" contain the same tokens, but the semantics of the expressions are completely different due to the different token order. In order to introduce the position information of the word element into the semantic representation, the embodiment adopts a sine type position coding method, and generates a corresponding position vector according to the position of the word element in the sequence. Specifically, for a position pos, the calculation formula of the i-th element of the position encoding vector PE (pos) is: PE (pos, 2 i) =sin (pos/10000 (2 i/d_model)), PE (pos, 2i+1) =cos (pos/10000 (2 i/d_model)), where d_model is the dimension of the word embedding vector, 768 in this embodiment. The value of i ranges from 0 to d_model/2-1, so that the dimension of the position coding vector PE (pos) is consistent with the word embedding vector, and is 768 dimensions. Specifically, the metadata text sequence is "user account table account ID", the corresponding word ID sequence is [1023, 758, 3197, 758, 981], and the sequence length n is 5. For the first word element 'user', the position pos=0 is calculated according to a position coding formula ：PE（0,0）=sin（0/10000^（0/768））=0,PE（0,1）=cos（0/10000^（0/768））=1,PE（0,2）=sin（0/10000^（2/768））=0,PE（0,3）=cos（0/10000^（2/768））=1,...PE（0,766）=sin（0/10000^（766/768））≈0,PE（0,767）=cos（0/10000^（766/768））≈1.

Thus, the position-coding vector of the lemma "user" is: PE (0) = [0,1,0, 1. ], 0,1]. Similarly, the position-coding vectors ：PE（1）=[0.84,0.54,0.91,-0.42,......,0.96,-0.28],PE（2）=[0.91,-0.42,0.54,-0.84,......,0.28,-0.96],PE（3）=[0.14,-0.99,-0.76,-0.65,......,-0.53,0.85],PE（4）=[-0.65,-0.76,-0.99,0.14,......,0.85,0.53]. of other tokens can be calculated to ultimately result in a position-coding matrix ：[0.00,1.00,0.00,1.00,......,0.00,1.00],[0.84,0.54,0.91,-0.42,......,0.96,-0.28],PE=[0.91,-0.42,0.54,-0.84,......,0.28,-0.96],[0.14,-0.99,-0.76,-0.65,......,-0.53,0.85],[-0.65,-0.76,-0.99,0.14,......,0.85,0.53]. having dimensions of 5×768, consistent with the dimensions of the generated word-embedding vector matrix e_seq. In a subsequent step, the position-coding matrix PE is added to the word-embedding vector matrix e_seq to obtain an input feature representation of the fused position information.

In step S25, we obtain a word embedding vector matrix e_seq corresponding to the metadata word ID sequence, with dimensions n×768, where n is the sequence length. In step S26 we generate a position-coding matrix PE, also of dimension n×768, representing the position information of each token in the sequence. Now, the position-coding matrix PE is added to the word-embedding vector matrix e_seq to obtain the final input feature matrix X: x=e_seq+pe, specifically, for the ith term, its input feature vector x_i is: x_i=e_i+pe_i, i=1, 2. E_i is a word embedding vector of the word element, and pe_i is a position coding vector of the word element. By adding position codes and word embedding, the input feature matrix X not only contains semantic information of each word element, but also integrates the position information of the word elements in the sequence. The fusion mode can help the model to better understand the semantic structure of the metadata text, and capture the sequence relation and the dependency relation among the words. Taking the calculation results of steps S25 and S26 as an example, the word embedding vector matrix e_seq and the position coding matrix PE are respectively ：[0.23,-0.11,......,0.09][0.00,1.00,......,1.00],[0.35,0.27,......,-0.18][0.84,0.54,......,-0.28],E_seq=[...,......,...,...]PE=[0.91,-0.42,......,-0.96],[0.42,0.13,......,-0.23][0.14,-0.99,......,0.85],[0.19,0.08,......,-0.31][-0.65,-0.76,......,0.53]. and added to obtain the dimension of the input feature matrix X of the input feature matrix X：[0.23,0.89,......,1.09],[1.19,0.81,......,-0.46],X=[...,......,...,...],[0.56,-0.86,......,0.62],[-0.46,-0.68,......,0.22]. as 5×768, each row represents the input feature of a word element, and the semantic information and the position information of the word element are fused. This fused feature will serve as an input to the RoBERTa model for further extraction of the deep semantic representation of the metadata.

The implementation of step S28 is as follows: the generated metadata input feature matrix X is input into the trimmed RoBERTa model. RoBERTa is a transducer-based pre-training language model, and the universal language expression capability can be learned by pre-training on a large scale of unlabeled corpus through self-supervision learning. In the metadata semantic representation task, we fine tune on the basis of RoBERTa to adapt to semantic features of a specific domain. The RoBERTa model consists essentially of a multi-layer transducer encoder, each layer including two sublayers of self-attention mechanisms and feed-forward neural networks. Self-attention mechanism, self-attention mechanism is used to capture dependency and interaction information between words. For the input matrix H≡l-1 of the layer I transducer, firstly, the query matrix Q fact l, the key matrix K fact l and the value matrix V fact l are calculated: q (l) =H (l-1) ×W_Q (l), K (l) =H (l-1) ×W_K (l), V (l) =H (l-1) ×W_V (l), wherein W_Q (l), W_K (l), W_V (l) are learnable parameter matrices. Then, a self-attention weight matrix A≡l is calculated: a (l) =softmax (Q (l) x (K (l)). The self-attention weight matrix A (l) characterizes the correlation between each word element and other word elements. Finally, the value matrix V (l) is weighted and summed according to the self-attention weight matrix A (l) to obtain a self-attention output matrix Z (l): z (l) =A (l) ×V (l), the self-attention mechanism enables the representation of each lemma to be fused with the information of other lemmas, capturing long-distance dependency between lemmas.

The feedforward neural network is used for carrying out nonlinear transformation on the self-attention output and extracting higher-level characteristic representation. For the self-attention output matrix Z (l), obtaining an output matrix H (l) through a two-layer feedforward neural network: h (l) =ReLU (Z (l) ×W_1 (l) +b_1 (l))×W_2 (l) +b_2 (l), where W_1 (l), b_1 (l), W_2 (l), b_2 (l) are learnable parameter matrices and bias terms, and ReLU is an activation function. Through the self-attention mechanism and the feedforward neural network of the multi-layer transducer, the RoBERTa model can extract the deep semantic representation of the metadata text and capture the complex interactions and the dependency relations between the word elements.

In addition to the metadata input feature matrix X, a special CLS symbol is included in the RoBERTa model input. The [ CLS ] symbol is used to represent the comprehensive information of the whole input sequence, which is continuously integrated with the information of other words in the process of transform coding, and finally can represent the semantics of the whole metadata. Therefore, in the output of the last layer of the transducer encoder of RoBERTa model, the vector corresponding to the [ CLS ] symbol is taken as the semantic representation vector of the whole metadata, denoted as v_meta: v_meta=h_cls (L), in this way, the metadata text sequence is mapped into a semantic representation vector of a fixed dimension (usually 768), which integrates the semantic information of each word in the metadata, and also implies the sequence and dependency relationship between the words. The semantic representation can be used for subsequent tasks such as metadata semantic matching, knowledge fusion and the like, and helps to achieve more intelligent data management and analysis. Where h_cls≡ (L) represents a vector corresponding to the [ CLS ] symbol output by the L-th layer (last layer) transform encoder.

Fig. 4 is an exemplary flowchart of acquiring hidden layer feature representations according to some embodiments of the present disclosure, where the input feature matrix X obtained in step S27 is input into a trimmed RoBERTa model. The RoBERTa model consists of an L-layer transducer encoder, each layer containing two sublayers of a multi-headed self-attention mechanism and a feed-forward neural network. For a first layer transform encoder of RoBERTa model, let its input vector be H≡1, obtain the context semantic representation of metadata by: and calculating the attention weight among the words in the input vector by utilizing a multi-head self-attention mechanism, and capturing the dependency relationship among the words. The input vector H (l-1) is subjected to linear transformation to obtain a query matrix Q (l), a key matrix K (l) and a value matrix V (l): q (l) =H (l-1) ×W_Q (l), K (l) =H (l-1) ×W_K (l), V (l) =H (l-1) ×W_V (l), wherein W_Q (l), W_K (l), W_V (l) are learnable parameter matrices. Multiplying the query matrix Q≡l by the transpose of the key matrix K fact l and dividing by sqrt (d_k) to obtain the attention score matrix S fact l: s (l) =Q (l) x (K (l)). Times.T/sqrt (d_k), where d_k is the dimension of the query/key vector. Carrying out softmax normalization on the attention score matrix S (l) to obtain an attention weight matrix A (l): a (l) =softmax (S (l)), and the attention weight matrix a (l) and the value matrix V (l) are used for weighted summation to obtain an output vector Z (l) of the multi-head self-attention mechanism: z (l) =A (l) ×V (l), performing residual connection on an input vector H (l-1) and a multi-head self-attention output vector Z (l), and obtaining a normalized representation vector N (l) through layer normalization: n (l) = LayerNorm (H (l-1) +Z (l)).

Nonlinear transformation is carried out on the normalized representation vector N (l) by utilizing a feedforward neural network, and a transformed vector F (l) is obtained: f (l) =ReLU (N (l) ×W_1 (l) +b_1 (l))×W_2 (l) +b_2 (l), wherein W_1 (l), b_1 (l), W_2 (l), b_2 (l) are learnable parameter matrices and bias terms, and ReLU is an activation function. e. Residual connection is carried out on the transformed vector F (l) and the normalized representation vector N (l), and an output vector H (l) of the current transform encoder is obtained through layer normalization: h (L) = LayerNorm (N (L) +f (L)), and repeating until the last layer (layer L) of the RoBERTa model, taking the output vector H (L) of the current transducer encoder as the input vector H (L) of the next layer transducer encoder. In the output vector H (L) of the last layer of transducer encoder, extracting the vector corresponding to the [ CLS ] symbol as the context semantic representation vector v_meta of the metadata: v_meta=h_cls≡l, where h_cls+l represents a vector corresponding to the [ CLS ] symbol in h+l. And taking the context semantic representation vector v_meta of the metadata as a hidden layer characteristic representation of the metadata for subsequent semantic matching and knowledge fusion tasks. Through the above steps, a RoBERTa model is utilized to extract the hidden layer feature representation of the metadata.

FIG. 5 is an exemplary flow chart for obtaining a semantic representation vector, according to some embodiments of the present description, from the output vector of the last layer of the transducer encoder of RoBERTa model, a fixed-dimension semantic representation vector v_meta of metadata is obtained. In the flow shown in FIG. 4, we have obtained a context semantic representation vector v_meta of the metadata, which corresponds to the [ CLS ] symbol vector output by the last layer of the transducer encoder of RoBERTa model. v_meta is a dense vector of a fixed dimension, typically 768. The vector fuses semantic information of each word element in the metadata and structural information among the word elements, and represents comprehensive semantic representation of the whole metadata. And carrying out L2 normalization on the obtained fixed-dimension semantic representation vector v_meta to obtain a normalized semantic representation vector v_norm. L2 normalization is carried out on v_meta to obtain a vector v_norm with unit length: v_norm=v_meta/|v_meta|2, where v_meta|2 represents the L2 norm of v_meta, i.e. the square root of the sum of the squares of the elements of the vector. The L2 normalization enables the length of the semantic representation vector to be 1, eliminates the influence of the vector length, and enables subsequent hash coding and similarity calculation to be more stable and reliable. A plurality of hash functions are randomly generated, each hash function corresponding to a binary bit. For each dimension of the feature of the normalized semantic representation vector v_norm, a hash function is used to calculate a corresponding hash value.

We need to generate a hash code of k binary bits, then randomly generate k hash functions h_1, h_2, once again, h_k, each hash function h_i corresponds to a binary bit b_i. The hash function may be constructed using a random hyperplane method: h_i (v) =sign (v·r_i), where r_i is a randomly generated hyperplane normal vector, co-dimensional with v_norm; sign represents a sign function, taking 1 when x >0, and taking 0 when x < = 0. For each dimension v_j of v_norm, calculate its hash value: b_ij=h_i (v_j), i=1, 2,... And combining the hash value obtained by each hash function into a binary hash code to be used as a semantic hash representation of the metadata. For each hash function h_i, its hash value b_ij in the dimensions v_norm is combined into one binary bit b_i: b_i= [ b_i1, b_i2, &..once again, &..b_id ], and finally, combining the binary bits obtained by the k hash functions into a k-bit binary hash code b: b= [ b_1, b_2, ], b_k ], this k-bit binary hash code b is the semantic hash representation of the metadata, which characterizes the semantic information of the metadata in the form of a compact. The numeric ID of the metadata is used as a key, and the corresponding semantic hash representation is used as a value and stored in a hash table.

Assuming that the value ID of the metadata is meta_id, the key value pair (meta_id, b) is stored in a hash table. The hash table establishes a mapping relation between the metadata ID and the semantic hash representation thereof, thereby facilitating subsequent quick retrieval and matching. And sequencing key value pairs in the hash table according to the semantic hash representation, constructing a semantic index, and taking the obtained semantic index as a final semantic representation vector of the metadata. The key value pairs (meta_id, b) in the hash table are ordered according to the binary value of the semantic hash representation b, resulting in an ordered index structure, i.e. a semantic index. The semantic index gathers the semantically similar metadata together, so that the semantically similar metadata are closer in the index, and quick searching and matching are facilitated. The semantic index is taken as a final semantic representation vector v_index of the metadata: index= [ meta_id_1, meta_id_2, &.. meta_id_i represents the numerical ID of the i-th metadata in the semantic index, and n is the total number of metadata. The semantic representation vector v_index orders and organizes metadata in the form of metadata IDs, and reflects semantic similarity relations among the metadata.

Fig. 6 is an exemplary flowchart of acquiring a center word, step S3, according to some embodiments of the present description: with an understanding of the query intent, the user enters data analysis requirements in natural language form, such as "look up data sheet related to' sales. The query intent entered by the user is entered into BiLSTM-CRF model, and the contextual features of the query intent are extracted through the bi-directional LSTM layer. The query intent is represented as a sequence of words [ w_1, w_2, &..once., w_n ], where n is the length of the query intent. And carrying out word embedding on each word w_i to obtain a word embedding vector x_i. Inputting the word embedding vector sequence [ x_1, x_2, & gt..once., x_n ] into a bi-directional LSTM layer, the forward LSTM and reverse LSTM processing the word sequence separately: forward LSTM: h_fi=lstm_f (x_i, h_f (i-1)), i=1, 2, i, n, reverse LSTM: h_bi=lstm_b (x_i, h_b (i+1)), i=n, n-1, &.. h_fi and h_bi represent hidden states of the ith word in the forward and reverse LSTM, respectively. Splicing the hidden states of the forward LSTM and the reverse LSTM to obtain a context feature representation c_i of the ith word: c_i= [ h_fi; h_bi ], and labeling each word in the query intention by adopting a conditional random field CRF layer according to the context characteristics to obtain an entity word in the query intention. A named Entity tag set l= { B-Entity, I-Entity, O }, representing the beginning, middle and non-Entity words, respectively, of the Entity words is defined. For each word w_i, the emission probability under each tag is calculated from its contextual characteristics c_i: emision_i=softmax (w_e·c_i+b_e), where w_e and b_e are the transmit probability matrix and the bias term. Decoding using the Viterbi algorithm yields the optimal named entity tag sequence y_1, y_2. The Entity words in the query intent are extracted according to the tag sequence, for example, "sales" is labeled "B-EntityI-Entity", and identified as one Entity word. And calculating the relevance weight of each entity word in the query intention and the query intention representation by using an attention mechanism to obtain the attention weight of each entity word. The entity words in the query intent are represented as [ e_1, e_2, once again, e_m ], where m is the number of entity words. The last hidden state h_n of the BiLSTM-CRF model is used as the representation q of the query intent. Calculating the attention weight a_i of each entity word e_i and the query intent representation q: a_i=softmax (e_i·w_a·q), where w_a is the attention weight matrix. And multiplying each entity word in the query intention by the corresponding attention weight to obtain a weighted entity word vector. For each entity word e_i, its weighting vector v_i is calculated: v_i=a_i.e_i, and weighting the entity word vectors to obtain an aggregated entity representation vector serving as a central word representation vector of query intention. Summing the weighted entity word vectors to obtain an aggregated entity representation vector e_c: e_c=sum (v_1, v_2,) v_m, and the aggregated entity representation vector e_c is used as a central word representation vector of the query intent.

Step S4: semantic matching and retrieval in step S3 we have obtained the central word representation vector q_c of the query intent. The semantic representation vector for the candidate metadata (data table and field) can be obtained by the flow shown in fig. 5. For each candidate metadata m_i, its semantic representation vector v_i is extracted using RoBERTa model. And obtaining the final semantic representation vector v_mi of the metadata m_i through L2 normalization, random hash, semantic index and other operations. And calculating the correlation between the query vector q_c and the candidate metadata semantic vector v_mi by using an attention mechanism to obtain a semantic matching score. The attention weights a_i of the query vector q_c and each candidate metadata semantic vector v_mi are calculated: a_i=softmax (q_c_w_m_v_mi), where w_m is the attention weight matrix. The attention weight a_i is taken as the semantic matching score s_i of the query vector q_c and the metadata m_i: s_i=a_i, the candidate metadata are ordered according to the matching score, and a data table and a field which are most relevant to the query intention are output. The candidate metadata is ordered from high to low according to the semantic matching score s_i. The top k data tables and fields with the highest matching scores are output as the most relevant results to the query intent.

Step S5: and constructing a domain knowledge graph, and extracting entities, relations and attributes from the industry knowledge base by using an ontology construction tool. Entities and their attributes are identified from structured data (e.g., relational database) or semi-structured data (e.g., XML, JSON). Entities and their relationships are extracted from unstructured text (e.g., documents, web pages) using techniques such as named entity recognition, relationship extraction, etc. The entities and relationships are typed and their categories in the ontology are determined. Constructing a domain knowledge graph, wherein nodes in the graph represent entities, and edges represent semantic relations among the entities. And taking the extracted entities as nodes in the knowledge graph, wherein each entity corresponds to a unique node. And connecting corresponding nodes according to the relation between the entities, wherein the relation type is used as a connected edge to represent semantic association between the entities. Modeling the attribute of the entity node, and attaching the attribute value to the corresponding entity node. Knowledge maps are formally represented using an ontology language (e.g., OWL, RDF) to support reasoning and querying. Entity types, relationship types, and constraints and axioms between them are defined using an ontology language. The entities, relationships, and attributes are converted into triplet (subject, predicate, object) form in the ontology language.

FIG. 7 is an exemplary flow chart of mapping entities and relationships to low latitude vector space according to some embodiments of the present description, we have constructed a domain knowledge graph in which nodes represent entities and edges represent semantic relationships between entities in step S5. The entity and the relation in the knowledge graph are represented and learned by using a graph embedding model TransE, and semantic information of the entity and the relation is characterized by a low-dimensional dense vector. For each triplet (h, r, t) in the domain knowledge-graph, where h represents the head entity, r represents the relationship, t represents the tail entity, their embedded vectors are randomly initialized. Let d be the dimension of the entity embedding vector and k be the dimension of the relation embedding vector. For each header entity h, its embedded vector h_end εR id is initialized randomly. For each relation R, randomly initialize its embedded vector r_embedded εR. For each tail entity t, its embedded vector t_end ε R id is initialized randomly.

Defining TransE an energy function f (h, r, t) of the model as the Euclidean distance between the sum of the head entity embedding vector and the relation embedding vector and the tail entity embedding vector: f (h, r, t) = ||h_emmbed+r_emmbed-t_emmbed||2, wherein, ||x||2 representing the L2 norm. And (3) adopting a negative sampling method, and randomly replacing a head entity or a tail entity for each positive sample triplet (h, r, t) in the domain knowledge graph to generate a corresponding negative sample triplet (h ', r, t) or (h, r, t'). For each positive sample triplet (h, r, t), it is randomly decided with a certain probability (e.g. 0.5) whether to replace the head entity or the tail entity. If the head entity is replaced, one entity h 'noteqh is randomly selected from the entity set, generating a negative-sample triplet (h', r, t). If the tail entity is replaced, one entity t 'noteqt is randomly selected from the entity set, generating a negative-sample triplet (h, r, t'). According to the defined energy function, the energy function value f (h, r, t) of the positive sample triplet and the energy function value f (h ', r, t) or f (h, r, t') of the corresponding negative sample triplet are calculated. And (3) using a gradient descent algorithm to minimize the difference between the energy function of the positive sample triplet and the energy function value of the negative sample triplet, and updating the entity embedding vector and the relation embedding vector. The loss function L is defined as the difference between the positive and negative sample energy function values: l=max (0, f (h, r, t) +γ -f (h ', r, t)) or l=max (0, f (h, r, t) +γ -f (h, r, t')), where γ is a super parameter, representing the interval between positive and negative samples. The gradient of the loss function L with respect to the entity embedding vector and the relation embedding vector is calculated. Updating the entity embedding vector and the relation embedding vector using a gradient descent algorithm, such as: h_end: =h_end- η ∂ L/∂ h_end, r_end: =r_end- η_end ∂ L/∂ r_end, t_end: =t_end- η ∂ L/∂ t_end, where η is the learning rate. And repeating the steps, and performing multi-round training on all triples in the knowledge graph until TransE models are converged. And obtaining entity embedding vectors and relation embedding vectors after mapping all the entities and relations in the domain knowledge graph to the low-dimensional vector space. For each entity e, get its final embedded vector e_ebed εR≡d. For each relation R, get its final embedded vector r_embedded εR.

FIG. 8 is an exemplary flow chart for obtaining an embedded vector of a relationship path, according to some embodiments of the present disclosure, for semantically matching a query intent vector with entities in a knowledge-graph, the relationship vector, and for finding entity nodes related to the query. And (3) calculating the semantic similarity between the query intention vector and the entity and relation vectors by using the entity embedding vector and the relation embedding vector obtained in the step S6. And selecting the entity nodes with the semantic similarity higher than the threshold value as entity nodes related to the query. From these entity nodes, an implicit multi-hop relationship path is mined in the graph using a relationship path reasoning algorithm DeepPath. And (3) taking the central word representation vector obtained in the step (S3) as an initial node, and performing random walk in the domain knowledge graph to generate a plurality of random walk sequences. Setting the step number of the random walk as T, and generating K sequences for each random walk. Starting from the entity node corresponding to the central word expression vector, randomly selecting the next node according to the weight of the edge, and generating a random walk sequence. And repeating the above process to generate K random walk sequences, wherein each sequence comprises T+1 nodes and T edges. For each random walk sequence, the node sequence is encoded by adopting a bidirectional LSTM, and the node embedded vector of the random walk sequence is obtained. The nodes in the random walk sequence are represented as their corresponding entity embedded vectors, and the edges are represented as their corresponding relationship embedded vectors. The forward and reverse node sequences are encoded using bi-directional LSTM: forward LSTM: h_fi=lstm_f (v_i, h_f (i-1)), i=1, 2, i, t+1, reverse LSTM: h_bi=lstm_b (v_i, h_b (i+1)), i=t+1, t..the use of the i-th node's entity embedded vector, 1, where v_i represents the i-th node's entity embedded vector, h_fi and h_bi represent hidden states of the ith node in the forward and reverse LSTM, respectively. Splicing the hidden states of the forward LSTM and the reverse LSTM to obtain an embedded vector p_i of the ith node: p_i= [ h_fi; h_bi ], and carrying out weighted aggregation on the obtained node embedded vectors by adopting an attention mechanism to obtain the aggregated random walk sequence embedded vectors as relationship path expression vectors.

Calculate the attention weight a_i of the query intent vector q and each node embedded vector p_i: a_i=softmax (q_w_a_p_i), where w_a is the attention weight matrix. And carrying out weighted summation on the node embedded vectors to obtain embedded vectors s of the random walk sequence: s=sum (a_i×p_i), i=1, 2,.,. And the query vector q is spliced to obtain a spliced vector x: x= [ s; h_end; r_emmbed; q ], inputting the spliced vector x into DeepPath model, calculating DeepPath model output through forward propagation, and optimizing DeepPath model parameters through backward propagation algorithm. The DeepPath model adopts a multi-layer perceptron (MLP) structure, and comprises an input layer, a hidden layer and an output layer. And inputting the vector x into an input layer of the DeepPath model, and obtaining a prediction result y of an output layer through nonlinear transformation of a plurality of hidden layers. Calculating the loss between the predicted result and the real label using a negative log likelihood loss function: l= -logp (y_true|x), where y_true is the true label (indicating whether a relationship path exists) and p (y_true|x) is the output probability of the DeepPath model. Gradients of the loss function with respect to the model parameters are calculated by a back-propagation algorithm and the model parameters are updated using an optimization algorithm (e.g., adam).

And repeating the steps, and training all the random walk sequences until the loss function value of the DeepPath model is smaller than a threshold value to obtain an optimized DeepPath model. And inputting the query vector q into the optimized DeepPath model, and obtaining a relationship path embedded vector q_path of the query vector through forward propagation. Through the steps, the knowledge graph reasoning is realized by using the relation path reasoning algorithm DeepPath. DeepPath generates a sequence of multiple relational paths by random walk and encodes the paths using bi-directional LSTM, generating an embedded representation of the paths. On this basis DeepPath uses the attention mechanism to aggregate path embedding, and splice it with query vectors, entities and relation embedding vectors, input into the MLP model for path existence prediction. Implicit patterns and rules of the relationship paths are learned by optimizing the MLP model DeepPath.

In step S4 we have obtained the result of the semantic matching of the data, i.e. the data tables and fields related to the query intent. In step S7, we derive a relationship path embedding vector related to the query intent through knowledge-graph reasoning. And comprehensively considering the data semantic matching result and the map reasoning result, and calculating the correlation of the data semantic matching result and the map reasoning result. For each data semantic matching result (data table or field), its semantic similarity sim_data to the query intent is calculated. The similarity between the embedded vector of the data table or field and the query intent vector may be calculated using a measurement such as cosine similarity. For each graph inference result (relationship path), its semantic similarity sim_path to the query intent is computed. The similarity between the relationship path embedded vector and the query intent vector may be calculated using a measurement such as cosine similarity.

Calculating a weighted relevance score of the data semantic matching result and the graph reasoning result: score=w_data+w_path, wherein w_data and w_path are weight coefficients for data matching and path reasoning, and can be adjusted according to actual requirements. And carrying out knowledge enhancement and optimization on the original data retrieval result. And reordering the data retrieval results according to the calculated relevance score. The data table or field with higher score is ranked in front and the data table or field with lower score is ranked in back. And expanding and enriching the data retrieval result by utilizing a relation path obtained by map reasoning. For each relationship path, a data table or field corresponding to the entity node to which it is connected is found. And taking the data tables or fields as additional retrieval results, and combining the additional retrieval results with the original data retrieval results. The merged results are ranked according to the relevance score. And generating a final data analysis result and returning the final data analysis result to the user.

Claims

1. An industry data analysis method based on NLP, comprising:

Step one, pre-training a language model, and fine-tuning the pre-trained language model by adopting marking data in an industry model library to obtain a field language model;

inputting the metadata of the input industry data into the domain language model obtained in the first step, extracting semantic features of the metadata to obtain semantic representation vectors of the metadata, wherein the metadata comprises table names and field names;

Thirdly, acquiring a central word of the query intention of the user by using a named entity recognition algorithm for the query intention input by the user; carrying out vectorization processing on the central word by adopting the domain language model obtained in the first step to obtain a query vector;

Step four, adopting an attention mechanism, and obtaining a matching score by calculating the similarity between the query vector and the semantic representation vector of the metadata of the industrial data in the step two; sorting the semantic representation vectors of the candidate metadata according to the matching score to obtain an industry data retrieval result of the user query vector;

Fifthly, constructing a domain knowledge graph by utilizing entities and relations in an industry model library;

Step six, mapping the constructed domain knowledge graph in the step five to a low latitude vector space by adopting a graph embedding model to obtain a distributed vector representation of the entity and the relationship;

Seventhly, carrying out semantic matching on the query vector obtained in the step three and the distributed vector representation of the entity and the relation obtained in the step six by adopting an attention mechanism to obtain the entity and the relation corresponding to the query vector; carrying out relationship path reasoning in the domain knowledge graph according to the query vector to obtain a relationship path of the query vector;

And step eight, calculating the relativity between the query vector and the entity, the relationship and the relationship path obtained in the step seven through semantic matching, and combining the industry data retrieval result to obtain a knowledge enhanced retrieval result as a final industry data analysis result.

2. The NLP-based industry data analysis method of claim 1, wherein:

the RoBERTa model was used as a language model.

3. The NLP-based industry data analysis method of claim 2, wherein:

fine tuning a pre-trained language model, comprising:

Constructing a training set by using the marking data in the industry model library;

Randomly shielding part of the word elements in each text data in the training set, and predicting the shielded word elements by using the context information to obtain word element prediction data taking the shielded word elements as labels as a first training set;

Randomly selecting two text fragments from the training set, if the two text fragments are adjacent in the original text, constructing a positive sample pair, otherwise, constructing a negative sample pair to obtain judgment data by taking whether the text fragments are adjacent as labels as a second training set;

And inputting the first training set and the second training set which are constructed into a pre-trained RoBERTa model, and performing fine adjustment on parameters of the RoBERTa model by adopting an Adam algorithm and minimizing a loss function according to the preset batch size and iteration round number.

4. The NLP-based industry data analysis method of claim 3, wherein:

extracting semantic representation vectors of metadata, comprising:

Carrying out named entity recognition on metadata of industry data to obtain a text sequence of the metadata;

Dividing a text sequence into word elements with sub word levels by adopting Byte Pair Encoding algorithm, counting the occurrence frequency of co-occurrence byte pairs in the word elements with the sub word levels, and merging the byte pairs with the occurrence frequency larger than a threshold value to obtain a word element table of the domain language model;

According to the word list, mapping each word in the text sequence of the metadata into a unique numerical ID to obtain a numerical word ID sequence;

Converting the word element ID sequence into a corresponding word element embedded vector sequence serving as an input characteristic of the domain language model;

generating a position vector of a corresponding dimension according to the position information of the word element in the word element ID sequence by adopting a sine type position coding method;

embedding the position information into the input features through the addition operation of the position vector and the word element embedding vector sequence;

The input features of the embedded position information are used as input vectors of the domain language model.

5. The NLP-based industry data analysis method of claim 4, wherein:

extracting semantic representation vectors of metadata, further comprising:

Inputting the obtained input vector into a fine-tuned RoBERTa model, wherein the RoBERTa model comprises a multi-layer transducer encoder;

In each layer of the Transformer encoder of the RoBERTa model, a contextual semantic representation of the metadata is obtained by:

calculating the attention weight among the words in the input vector by utilizing a multi-head self-attention mechanism, and acquiring the dependency relationship among the words;

The attention weight matrix and the input vector are used for weighted summation to obtain an output vector of the multi-head self-attention mechanism;

carrying out residual connection on an input vector and an output vector, and obtaining a normalized representation vector through layer normalization;

nonlinear transformation is carried out on the normalized representation vector by utilizing a feedforward neural network, so as to obtain a transformed vector;

residual connection is carried out on the transformed vector and the normalized representation vector, and the output vector of the current transducer encoder is obtained through layer normalization;

And taking the output vector of the current transducer encoder as the input vector of the next layer of transducer encoder, repeating the steps until the last layer of transducer encoder of RoBERTa models, and obtaining the context semantic representation vector of the metadata as the hidden layer characteristic representation of the metadata.

6. The NLP-based industry data analysis method of claim 5, wherein:

extracting semantic representation vectors of metadata, further comprising:

Obtaining a fixed dimension semantic representation vector of the metadata through an output vector of a last layer of a transducer encoder of RoBERTa model;

carrying out L2 normalization on the obtained fixed-dimension semantic representation vector to obtain a normalized semantic representation vector;

Randomly generating a plurality of hash functions, wherein each hash function corresponds to one binary bit; for each dimension characteristic of the normalized semantic representation vector, calculating a corresponding hash value by adopting a hash function;

combining the hash value obtained by each hash function into a binary hash code as a semantic hash representation of metadata;

taking the numerical ID of the metadata as a key, and storing the corresponding semantic hash representation as a value in a hash table;

And ordering key value pairs in the hash table according to the semantic hash table, constructing a semantic index, and taking the obtained semantic index as a final semantic representation vector of the metadata.

7. The NLP-based industry data analysis method of any one of claims 1 to 6, wherein:

in the third step, obtaining the center word of the query intention, which comprises the following steps:

Inputting the query intention input by the user into BiLSTM-CRF model, extracting the context characteristics of the query intention of the user through the bidirectional LSTM layer;

according to the context characteristics, a conditional random field CRF layer is adopted to name and label each word in the query intention, so that entity words in the query intention are obtained;

calculating the relevance weight of each entity word in the query intention and the query intention representation by using an attention mechanism to obtain the attention weight of each entity word;

Multiplying each entity word in the query intention by the corresponding attention weight to obtain a weighted entity word vector;

And weighting the entity word vectors to obtain aggregated entity expression vectors serving as central word expression vectors of the query intention.

8. The NLP-based industry data analysis method of claim 7, wherein:

in step six, mapping the entity and the relation to the low latitude vector space, including:

For each triplet in the domain knowledge graph, randomly initializing embedding vectors of the head entity, the relation and the tail entity to obtain an initial entity embedding vector and a relation embedding vector;

defining TransE energy functions of the model as Euclidean distance between the sum of the head entity embedded vector and the relation embedded vector and the tail entity embedded vector;

Adopting a negative sampling method, and randomly replacing a head entity or a tail entity for each positive sample triplet in the domain knowledge graph to generate a corresponding negative sample triplet;

According to the defined energy function, calculating an energy function value of the positive sample triplet and an energy function value of the corresponding negative sample triplet;

using a gradient descent algorithm to minimize the difference between the energy function of the positive sample triplet and the energy function value of the negative sample triplet, and updating the entity embedding vector and the relation embedding vector;

Repeating the steps until TransE models are converged, and obtaining entity embedded vectors and relation embedded vectors after all entities and relations in the domain knowledge graph are mapped to the low latitude vector space.

9. The NLP-based industry data analysis method of claim 8, wherein:

in step seven, a relation path of the query vector is obtained, which comprises the following steps:

taking the central word representation vector obtained in the step three as an initial node, carrying out random walk in a domain knowledge graph, and generating a plurality of random walk sequences, wherein each random walk sequence consists of nodes and edges;

For each random walk sequence, encoding the node sequence by adopting a bidirectional LSTM (least squares) to obtain node embedded vectors of the random walk sequences;

Carrying out weighted aggregation on the obtained node embedded vectors by adopting an attention mechanism to obtain the aggregated random walk sequence embedded vectors as relationship path expression vectors;

splicing the obtained relation path expression vector, the entity embedded vector and the relation embedded vector obtained in the step six, and the query vector to obtain a spliced vector;

inputting the spliced vector into DeepPath model, calculating DeepPath model output through forward propagation, and optimizing DeepPath model parameters through backward propagation algorithm;

Repeating the steps until the loss function value of the DeepPath model is smaller than a threshold value, and obtaining an optimized DeepPath model;

and inputting the query vector into the optimized DeepPath model, and obtaining an embedded vector of the relation path of the query vector through forward propagation.

10. A system based on the NLP-based industry data analysis method of any one of claims 1 to 9.