CN113987197A

CN113987197A - Dynamic fusion and growth method for product node system in whole field

Info

Publication number: CN113987197A
Application number: CN202111166990.2A
Authority: CN
Inventors: 张啸天; 宗畅; 杨彦飞; 许源泓
Original assignee: Hangzhou Liangzhi Data Technology Co ltd
Current assignee: Hangzhou Liangzhi Data Technology Co ltd
Priority date: 2021-10-01
Filing date: 2021-10-01
Publication date: 2022-01-28
Anticipated expiration: 2041-10-01
Also published as: CN113987197B

Abstract

The invention provides a dynamic fusion and growth method of a product node system in the whole field. Aiming at the cognitive decision-making requirement of a regional industry in the economic development process for a fine-grained emerging field, on the basis of the existing authoritative product classification system, product concept nodes are continuously mined from massive internet semi-structured and unstructured heterogeneous data sources by using natural language processing and knowledge graph technologies such as concept acquisition, relation discrimination, attribute fusion and the like, the product concepts are represented by using a text embedding technology, and then the product concepts and the original product system node relation are judged, fused and hung, so that the node system content is continuously expanded, and a whole-field product node system capable of being dynamically fused and grown is formed. In addition, the invention can also ensure the authority and the accuracy of a product node system in the whole field through a man-machine cooperative interaction flow in the system construction and updating process.

Description

Dynamic fusion and growth method for product node system in whole field

Technical Field

The invention relates to the fields of computer technology, artificial intelligence and natural language processing, in particular to a dynamic fusion and growth method of a product node system in the whole field.

Background

With the development of computer science technology and artificial intelligence technology, automation and intellectualization become key elements of various industries in digital innovation and upgrade, and especially in economic digital reform, the traditional regional industry economic development cognitive analysis and decision depend on expert experience, so that a large amount of labor cost is submerged, a product system framework in industrial analysis cannot be effectively abstractly precipitated, and a fast and effective standardized construction of a fine-grained product framework system cannot be implemented.

By combining the scene requirements of the product system construction, the product system in the whole field needs to be constructed on a set of international and universal standard industry classification, and then product classification nodes with product categories and finer granularity are extended downwards. The problems to be considered include how to select a standard industry classification system base, how to identify product concepts from semi-structured annual report data and unstructured thesis patent texts, how to perform embedded representation on system nodes, how to judge synonymy and superior and inferior relations among products, how to ensure accurate and continuous dynamic fusion and growth of a product system, and ensure high-quality construction of a product system in the whole field.

Therefore, a dynamic fusion and growth method of a product node system in the whole field is urgently needed, automatic construction of the product system is rapidly carried out, cognitive decision of regional industries is assisted, and the innovation requirement of industrial economy digitization is met.

Disclosure of Invention

In view of the above, the invention aims to solve the problems that the industry knowledge in the mind of experts cannot be deposited in the industry analysis research and the industry product knowledge system cannot be quickly established, and provides a dynamic fusion and growth method for a product node system in the whole field. The method is a cross-language whole-industry-chain product system construction and system self-growth dynamic updating method provided for a system framework in industrial analysis decision, can mine product concepts based on heterogeneous data sources, performs dynamic fusion and growth updating of the system for new products which are continuously emerged, and provides a new idea for exploration, classification and analysis of the new products.

In order to achieve the purpose of the invention, the technical scheme adopted by the invention specifically comprises the following steps:

a dynamic fusion and growth method of a product node system in the whole field comprises the following steps:

s1, taking a general product classification system required by the construction of a whole-field product system as an upper-layer framework of a product node system, and further performing fine tuning on a data set of the general product classification system by using a pre-training language model to obtain a field language model for obtaining word embedded representation of each node in the product node system;

s2, extracting product concepts from unstructured text data containing the product concepts by using a pre-trained product concept extraction model, simultaneously extracting the product concepts from semi-structured text data containing the product concepts based on rules, continuously and dynamically updating the unstructured text data and the semi-structured text data so as to continuously extract vocabularies and phrases of the product concepts from the semi-structured text data, and combining the vocabularies and phrases to form a candidate product concept set;

s3, training a synonymy concept judgment model by utilizing a product concept alias library, judging the synonymy relationship between candidate product concepts in the candidate product concept set and nodes in the existing product node system, fusing the product concepts and the nodes which accord with the synonymy relationship as concept-node pairs to obtain a node system with an alias expanded, and meanwhile, taking the product concepts which do not accord with the synonymy relationship with any node as new product concepts;

s4, constructing a node-node pair training set according with the superior-inferior relation according to the existing product node system based on the domain language model obtained in S1, training to obtain a superior-inferior relation classification judgment model, enabling the superior-inferior relation classification judgment model to judge the direct parent node of the node concept, predicting the parent node of each new product concept obtained in S3 by using the trained superior-inferior relation classification judgment model, and hanging and expanding the new product concept into the product node system according to the prediction result;

and S5, respectively sending the candidate product concept set obtained in S2 and the node systems expanded in S3 and S4 to a manual auditing end for verification, finally updating the product node system according to a verification result, and simultaneously updating the training samples of the models used in S2-S4 to improve the performance of the models, thereby realizing the continuous and dynamic construction of the whole-field product node system.

Preferably, the S1 specifically includes the following steps:

s11, constructing a requirement according to a product system in the whole field, and forming an upper layer framework of the product node system by taking a universal product classification system (HS) code as a seed node system so as to obtain an upper-lower order relation data set in the product node system;

s12, fine tuning training is carried out on the description text of the seed node system by utilizing the Bert pre-training language model, semantic features in the field text expression are learned, a field language model is obtained, and the feature vector of each node concept in the product node system can be obtained by utilizing the field language model.

Preferably, the S2 specifically includes the following steps:

s21, carrying out rule-based structured analysis and extraction on the product concepts in the text for the semi-structured text data containing the product concepts obtained by continuous collection to generate a first candidate product concept set;

s22, obtaining a training sample set containing a product concept sequence for continuously acquired unstructured text data containing product concepts through manual labeling, further training a product concept extraction model on the basis of the training sample set by using an NLP sequence labeling model, extracting the product concept sequence on continuously acquired new unstructured text data through the product concept extraction model, and generating a second candidate product concept set;

and S23, merging the first candidate product concept set and the second candidate product concept set into a candidate product concept set which is used as a basis for expanding the existing product node system.

Further, the semi-structured text data containing product concepts is an enterprise yearbook.

Further, the unstructured text data containing product concepts are patent text data and/or thesis text data.

Preferably, the S3 specifically includes the following steps:

s31, constructing a synonymy concept sample set which accords with the product synonymy relation according to the alias information of the product concept, utilizing a sequence classification task in a Bert pre-training language model application scene, training a synonymy concept discriminant model based on the synonymy concept sample set, further aiming at the candidate product concept set, utilizing the synonymy concept discriminant model to predict the synonymy concept relation between each candidate product concept in the candidate product concept set and each node in an existing product node system, and if one candidate product concept in the candidate product concept set accords with the synonymy concept relation with one node in the existing product node system, taking the candidate product concept set and the node as a concept-node pair which accords with the synonymy relation; if one candidate product concept in the candidate product concept set does not accord with the synonymous concept relationship with any one node in the existing product node system, taking the candidate product concept as a new product concept and adding the new product concept into the new product concept candidate set;

and S32, aiming at the concept-node pairs which are obtained in the S31 and accord with the synonymy relationship, fusing the candidate product concept nouns into corresponding node attributes in the existing product node system, and storing the node concept nouns into alias attribute fields of the node instances in the product library to realize node alias attribute fusion.

Preferably, the S4 specifically includes the following steps:

s41, constructing a query-node concept pair training set by utilizing the existing node up-down relation in a product node system, wherein in each query-node concept pair, a query represents a product concept to be hooked, a node represents a product node in the product node system, product node information formed by all nodes is represented by a node graph structure, the label of the training set is set to be 1 or 0, wherein 1 represents that the node is a direct parent node of the query, and 0 is opposite;

s42, initializing the query product concept and the feature vector of each product node in the node diagram structure by using the domain language model obtained in S1, carrying out propagation fusion and iterative update on the feature of each node in the node diagram structure by using a GNN diagram neural network model to obtain word embedding representation of each query and node, inputting the word embedding representation into a two-classification model, and training the two-classification model to enable the two-classification model to identify whether the node in the query-node concept pair is a direct parent node of the query or not so as to obtain an upper and lower position relation classification judgment model;

and S43, for the new product concepts obtained in S3, judging the superior-inferior relation of each new product concept and each existing node in the product node system one by utilizing the superior-inferior relation classification judgment model, and calculating the existing node with the highest matching degree as a direct parent node so as to carry out hanging expansion of the product node system.

Preferably, in S5, a human-computer interaction verification tool is used to perform manual review and verification on the product concept extraction result, the product alias fusion result, and the product system growth result in steps S2, S3, and S4, and meanwhile, the sample data sets used for training each model in steps S2, S3, and S4 are continuously updated according to the verification result, so that the model performance is iteratively improved, and a high-quality whole-domain product node system is continuously constructed.

The dynamic fusion and growth method of the whole-field product node system can quickly construct a standardized product knowledge system facing the application scene of the regional industry cognitive decision, and in addition, a set of semi-automatic flow for constructing the whole-field product node system is output, so that the method can be applied to an industrial chain analysis decision system, and is beneficial to improving the automation and intelligence degree of the industry development cognitive decision process.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a schematic full-flow diagram of a dynamic fusion and growth method of a product node system in the whole field according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a concept of mining and identifying products from heterogeneous data sources according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a modeling method of a query-node concept pair context decision model according to an embodiment of the present invention;

fig. 4 is a schematic diagram of iterative update of a product knowledge system through man-machine cooperative interaction according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, steps, operations, or elements, but do not preclude the presence or addition of one or more other features, steps, operations, or groups of elements.

It should also be understood that when directional indications are referred to in embodiments of the present invention, the directional indications are merely used to explain the relative relationship of the components built up at a particular location, and if the particular location changes, the directional indications change accordingly.

In addition, the product node system in the whole field constructed in the invention is constructed by using the product knowledge in the general industry as the guide in the construction process, but the construction method is not limited to any industry, and the product node system in any specific industry can be constructed by the construction method and the dynamic updating method.

In the invention, the dynamic fusion and growth method of the product node system in the whole field comprises the following steps:

and S1, taking an authoritative general product classification system required by the construction of the whole-field product system as an upper-layer framework of the product node system, and further performing Fine tuning (Fine-tune) on a data set of the general product classification system by using a pre-training language model to obtain a field language model for obtaining the word embedded expression of each node in the product node system.

And S2, extracting product concepts from the unstructured text data containing the product concepts by using the pre-trained product concept extraction model, simultaneously extracting the product concepts from the semi-structured text data containing the product concepts based on rules, and continuously and dynamically updating the unstructured text data and the semi-structured text data so as to continuously extract words and phrases of the product concepts from the semi-structured text data, and combining the words and phrases to form a candidate product concept set.

S3, training a synonymy concept judgment model by utilizing a product concept alias library, judging the synonymy relationship between candidate product concepts in the candidate product concept set and nodes in the existing product node system, fusing the product concepts and the nodes which accord with the synonymy relationship as concept-node pairs to obtain the node system after alias expansion, and meanwhile, taking the product concepts which do not accord with the synonymy relationship with any node as new product concepts.

S4, based on the domain language model obtained in S1, a node-node pair training set which accords with the superior-inferior relation is constructed according to the existing product node system, a superior-inferior relation classification judgment model is obtained through training, so that the direct parent node of the node concept (query) can be judged, the trained superior-inferior relation classification judgment model is used for predicting the parent node of each new product concept obtained in S3, and the new product concept is hung and expanded into the product node system according to the prediction result.

The following describes a specific implementation manner of the above steps S1 to S5 in this embodiment.

In the embodiment of the present invention, for the step S1 shown in fig. 1, the specific implementation process is as follows:

s11, according to the construction requirements of the product system in the whole field, selecting a general product classification system which meets the conditions as an upper-layer framework of the product node system, and constructing a superior and inferior relation data set of the nodes in the product node system. It should be noted that the selected general product classification system needs to be authoritative, and in this embodiment, based on professional background knowledge, from the aspects of node scale, node granularity, node expression form, and the like, an HS code (customs code) is selected as a seed node system to form an upper-layer architecture of the authoritative product node system, and then an upper-lower relationship data set in the node system is constructed and obtained according to a standard system. The seed node system is used as a basic structure of the whole product node system, nodes in the system represent product concepts of different levels and have a superior-inferior relation, and attribute expansion of the nodes and hanging growth of new product concept nodes can be further performed on the basis of the seed node system subsequently, so that the product node system is continuously expanded, and the whole-field product node system is continuously and dynamically constructed.

S12, Fine-tune training is carried out on the description text of the seed node system by utilizing the Bert pre-training language model, semantic features in field text expression are learned, the field language model is obtained after training is finished, and the feature vector of each node concept in the product node system, namely the word embedding expression of the node, can be obtained by utilizing the field language model.

Fig. 2 is a schematic diagram of extracting product concept words from a heterogeneous data source according to an embodiment of the present invention, where a product concept needs to be extracted from two types of heterogeneous data, namely semi-structured text data and unstructured text data, to further extend an existing product node system, where the process corresponds to step S2 in the method shown in fig. 1, and the specific implementation process is as follows:

and S21, carrying out rule-based structured analysis and extraction on the product concepts in the text for the semi-structured text data containing the product concepts obtained by continuous collection to generate a first candidate product concept set.

And S22, obtaining a training sample set containing a product concept sequence for continuously acquired unstructured text data containing product concepts through manual labeling, further training a product concept extraction model on the basis of the training sample set by using an NLP sequence labeling model, extracting the product concept sequence on continuously acquired new unstructured text data through the product concept extraction model, and generating a second candidate product concept set.

It should be noted that, since the product concepts in the industry are updated continuously over time, the extraction process of the candidate product concepts in this step may be continued, so as to implement dynamic update of the product node system. In actual operation, new semi-structured text data and unstructured text data can be continuously collected and accumulated, and then candidate product concepts are extracted from the accumulated data periodically for updating a product node system.

The specific forms of the above-described semi-structured text data and unstructured text data may be various as long as the product concept is contained therein. In this embodiment, the semi-structured text data may adopt semi-structured product information table data disclosed in an enterprise yearbook, and related product information may be extracted by a rule method, and a specific extraction rule may be determined according to a language characteristic of a product concept in the text.

Whereas for unstructured text data, embodiments of the present invention employ patent text data and paper text data instead. Product concepts in unstructured text data cannot be extracted by rules, so accumulated thesis texts and patent text data are required to be used as labeling linguistic data to form labeling data, a product concept extraction model is trained by the labeling data based on a frame of Bert + LSTM + CRF, product concept entities are mined on accumulated patent and thesis data sets, and candidate product concept words are generated. The specific process of the product concept extraction model for carrying out concept extraction is as follows:

first, the label of token sequence in text data includes three categories, namely "O", "B-Product", "I-Product", wherein "B-Product" represents the beginning token of Product concept sequence, "I-Product" represents the middle and ending tokens of Product concept sequence, and "O" represents tokens of non-Product concept sequence.

And then, processing the marked training data and reading the training data into a Bert pre-training model to obtain pre-training vector characteristics, reading the characteristics into a classical LSTM + CRF model to obtain final characteristics, judging whether each token belongs to a product concept sequence, finishing model training and extracting a new product concept.

In addition, the step S3 shown in fig. 1 is implemented as follows:

s31, constructing a synonymy concept sample set which accords with the product synonymy relation according to pre-accumulated product concept alias information, wherein each synonymy concept sample corresponds to a group of words with the same meaning and expressing different product concepts, training a synonymy concept discrimination model based on the synonymy concept sample set by utilizing a classic sequence classification task in a Bert pre-training language model application scene, then, aiming at the candidate product concept set, carrying out synonymy concept relationship prediction between each candidate product concept in the candidate product concept set and each node in the existing product node system by using a synonymy concept discrimination model, and if one candidate product concept in the candidate product concept set and one node in the existing product node system accord with the synonymy concept relationship, taking the candidate product concept and the node as concept-node pairs which accord with the synonymy relationship; if one candidate product concept in the candidate product concept set does not accord with the synonymous concept relationship with any one node in the existing product node system, taking the candidate product concept as a new product concept and adding the new product concept into the new product concept candidate set;

In this embodiment, for step S31, in the process of training the synonymy relationship discrimination model between concepts, Bert is also used as a pre-training model to construct semantic feature vectors of the concepts, and then a typical sequence classification task in a bertfine-tune task scene is used as a method, and the concept-node pairs are used as inputs, and whether they are synonymy relationships is used as a prediction label, so as to perform model training, and further, the model training is used to predict whether new concept-node pairs satisfy the synonymy relationships.

Fig. 3 is a modeling method of a context decision model in a node-node pair (i.e., a query-node concept pair) involved in step S4, which specifically includes the following steps:

training and identifying a context relation model between the new node and the existing node according to the following method:

s42, initializing query product concepts and feature vectors of each product node in a node diagram structure by using the domain language model obtained in S1, carrying out propagation fusion and iterative update on each node feature by adopting a GNN diagram neural network model in the node diagram structure to obtain word embedding representation of each query and node, inputting the word embedding representation into a two-classification model, and training the two-classification model to enable the two-classification model to identify whether the node in a query-node concept pair is a direct parent node of the query or not so as to obtain an upper and lower bit relation classification judgment model;

s43, for the new product concepts obtained in S3, the superior-inferior relation classification judgment model is utilized to judge the superior-inferior relation between each new product concept and each existing node in the product node system one by one, the existing node with the highest matching degree is calculated to be used as a direct parent node, and therefore hanging expansion of the product node system is carried out

For convenience of understanding, the diagram structure data set constructed in the step S41 and the feature vector generation method on the diagram structure of the step S42 in the embodiment are explained in detail below.

Part 1 is node n as shown in fig. 3_iCorresponding diagram structure

The graph structure comprehensively considers the parent node and the sibling node of the node. For a certain node n_iLet it be node b in the left graph, and its neighboring node set be represented by N (b), where for the purpose of expressing the computational formula of GNN, the definition

The vector update calculation formula of the node b in the kth iteration is the union of the node b and the adjacent node set n (b):

wherein the content of the first and second substances,

is a word-embedded representation of the initial state,

word-embedded representation, Agg, for node b kth iteration^(k)For all nodes at the k-th iteration

And performing propagation aggregation calculation on the basis of the last iteration result. Commonly used Agg functions in graph structures are the graph convolution neural network GCN and the graph attention neural network GAT, where the Agg functions are defined by the GCN as follows:

where ρ is a nonlinear activation function, such as ReLU, etc., W is a parameter for model learning,

to normalize the coefficients, the values for each of the participating iterations are the same.

If the normalization coefficient is regarded as an importance weight coefficient between two nodes, each adjacent node is equally regarded by the GCN during iterative computation, only the structural characteristics of the node are considered, and the weight information between the node and the adjacent node is not considered. The embodiment of the invention adopts an optimized GAT graph attention model and a GAT pair

Redefinition is performed such that the node information is propagated taking into account not only the structure information but also the node's own information:

in the above formula, z and W are both parameters to be learned, γ is a nonlinear activation function, | | represents a splicing operation, the step (3) is substituted into the step (2) to obtain a single GAT model, in the embodiment, a multi-head attention mechanism is adopted, and the results are spliced to obtain a final feature vector of each node.

As shown in part 2 of fig. 3, embodiments also embed a representation of location information in the iterative process, trained with the GAT propagation process, considering that the same node may be located at different locations in different graph structures. Position feature of node b at k-th iteration

Is shown in formula (2)

Replacement is by matrix calculation after position-embedding stitching:

where, | | denotes a splicing operation, O^(k-1)After feature splicing, the matrix W to be trained needs to align a part of the added training parameters.

The feature vector Embed finally output by the node feature map uses the weighted mean of the position vectors of all nodes in the map, and the calculation method is shown as the following formula:

as shown in part 3 of fig. 3, the feature vectors of query and node are spliced and then input into the MLP model, and two-class prediction is performed to determine whether the node of the product is a direct parent of the concept word of the query product. The embodiment of the invention adopts InfoLoss as a loss function for optimization, and the loss function is defined as follows:

in the above formula, X_iRepresenting an edge of a top-to-bottom relationship in a hierarchy<n_P,n_C>(n_PIs n_CDirect parent node) of the node, a set of 1 positive and N negative examples generated; x is a sample set generated by all edges; the range of j traverses all samples generated by the calculation from 1 to N + 1.

Fig. 4 is a schematic diagram of iterative update of a product knowledge system in step S5, where the update may be implemented by using a human-computer interaction verification tool, and the update is performed by performing manual review and verification on the product concept extraction result, the product alias fusion result, and the product system growth result in steps S2, S3, and S4, and if the verification is passed, the update is performed, otherwise, the update is not performed; and meanwhile, continuously precipitating and updating the sample data sets used for training each model in the steps S2, S3 and S4 according to the check result, iteratively improving the performance of the model, and continuously constructing to obtain a high-quality whole-field product node system.

For the following example, the method for judging the check and iterative update of the node system connection based on the upper-lower relationship comprises the following steps:

in part 1, aiming at the newly added product node generated in the step S42, a corresponding knowledge checking interface tool can be designed and developed so as to facilitate manual checking whether the superior-inferior relationship of the newly added product node in the system is correct, if so, the self-growth update of the product node is confirmed, and if adjustment is needed, the hierarchical position of the product node is manually adjusted, and then the update of the new product node system is confirmed;

in the part 2, synchronously updating the database of the upper and lower level relational node pairs of the product according to the manual verification result, and expanding a training sample set;

and in the part 3, according to a new model training sample and the accumulated magnitude, retraining the superior-inferior relation judgment model after a period of time, updating the model version, and iterating to continuously improve the model performance.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A dynamic fusion and growth method of a product node system in the whole field is characterized by comprising the following steps:

2. The dynamic fusion and growth method of a domain-wide product node system according to claim 1, wherein the S1 specifically comprises the following steps:

3. The dynamic fusion and growth method of a domain-wide product node system according to claim 1, wherein the S2 specifically comprises the following steps:

4. The domain-wide product node hierarchy dynamic fusion and growth method of claim 3, wherein the semi-structured textual data containing product concepts is an enterprise yearbook.

5. The method for dynamic fusion and growth of a domain-wide product node hierarchy as claimed in claim 3, wherein the unstructured text data containing product concepts are patent text data and/or thesis text data.

6. The dynamic fusion and growth method of a domain-wide product node system according to claim 1, wherein the S3 specifically comprises the following steps:

7. The dynamic fusion and growth method of a domain-wide product node system according to claim 1, wherein the S4 specifically comprises the following steps:

8. The method for dynamically fusing and growing a whole-domain product node system according to claim 1, wherein in the step S5, a human-computer interaction verification tool is used for performing manual review and verification on the product concept extraction result, the product alias fusion result and the product system growth result in the steps S2, S3 and S4, and meanwhile, the sample data set used for training each model in the steps S2, S3 and S4 is continuously deposited and updated according to the verification result, so that the performance of the model is iteratively improved, and a high-quality whole-domain product node system is continuously constructed.