CN117474013B

CN117474013B - Knowledge enhancement method and system for large language model

Info

Publication number: CN117474013B
Application number: CN202311818163.6A
Authority: CN
Inventors: 王亚; 赵策; 屠静; 苏岳; 万晶晶; 李伟伟; 颉彬; 周勤民; 张玥; 雷媛媛; 孙岩; 潘亮亮; 刘岩
Original assignee: Zhuoshi Future Beijing technology Co ltd
Current assignee: Zhuoshi Future Beijing technology Co ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-22
Anticipated expiration: 2043-12-27
Also published as: CN117474013A

Abstract

The invention relates to the technical field of data processing, and provides a large language model knowledge enhancement method and system, wherein the method comprises the following steps: acquiring an original transaction library; determining a transaction submatrix according to the distribution of transaction items in different transactions where each frequent 1-item set is located; determining the semantic similarity of the item set according to the similarity of element distribution in different transaction submatrices; determining rule potential coefficients according to the similarity degree between nodes on each link path in each FP subtree; determining a text information divergence index according to the similarity degree of semantic information carried by the item sets in the nodes on each link path in each FP subtree; obtaining redundancy probability according to the rule potential coefficient and the text information divergence index; obtaining emotion association rules based on redundancy probability by adopting a data mining algorithm; knowledge enhancement of large language models for emotion analysis is achieved based on emotion association rules. According to the emotion analysis reasoning capacity of the model knowledge is improved by mining emotion association rules in the evaluation text.

Description

Knowledge enhancement method and system for large language model

Technical Field

The invention relates to the technical field of data processing, in particular to a large language model knowledge enhancement method and system.

Background

The large language model (Large Language Model, LLM) is a universal language model with large-scale parameters, which is obtained by training on a large-scale corpus by using a deep learning technology, has stronger context understanding capability and language generating capability, and is applied to various fields including information retrieval and question answering, automatic abstract and article generation, emotion analysis and public opinion monitoring and the like. But the large language model lacks deep understanding of knowledge in the specific field and lacks understanding and generating capability of personalized emotion expression, and the data mining algorithm based on the association rule can help the large language model to mine association rules and semantic relations among words from large-scale text data, so that the understanding capability of the large language model on knowledge in the specific field is improved, the individuation of emotion expression is improved, and important support is provided for performance improvement and application scene expansion of the large language model.

The FP-growth (Frequent pattern growth) algorithm is a data mining algorithm for finding association rules between data, by constructing FP-trees to find frequently occurring item sets in the data sets, and generating association rules based on the frequent item sets. Therefore, the data mining algorithm can mine hidden information and knowledge from large-scale text data and apply the hidden information and knowledge to training of a large language model and knowledge base construction so as to enhance the expression capability and application range of the large language model. However, when the conventional FP-growth algorithm processes large-scale data, redundant frequent item sets exist in the generated FP-tree, which not only occupies a large amount of memory space, but also digs a large amount of association rules with low confidence level, thereby reducing the operation efficiency of the algorithm.

Disclosure of Invention

The invention provides a large language model knowledge enhancement method and a large language model knowledge enhancement system, which aim to solve the problem of association rule redundancy when emotion analysis large language model knowledge enhancement, and the adopted technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a large language model knowledge enhancement method, the method including the steps of:

acquiring an original transaction library based on word segmentation processing results of existing comment text data;

determining a transaction submatrix of each frequent 1-item set according to the distribution of transaction items in different transactions where each frequent 1-item set is located; determining the item set semantic similarity of each FP subtree according to the similarity of element distribution in the transaction submatrices of different frequent 1-item sets;

determining rule potential coefficients of the FP subtrees according to the similarity degree between nodes on each link path in the FP subtrees corresponding to each frequent 1-item set;

determining a text information divergence index of each FP subtree according to the similarity degree of semantic information carried by item sets in nodes on each link path in the corresponding FP subtree of each frequent 1-item set;

obtaining redundancy probability of each node on the FP subtree according to rule potential coefficients and text information divergence indexes of the corresponding FP subtree in each frequent 1-item set; adopting a data mining algorithm to obtain emotion association rules based on redundancy probabilities of all nodes on corresponding FP subtrees in each frequent 1-item set; knowledge enhancement of large language models for emotion analysis is achieved based on emotion association rules.

Preferably, the method for obtaining the original transaction library based on the word segmentation processing result of the existing comment text data comprises the following steps:

obtaining a preset number of pieces of comment text data from scoring software by utilizing a crawler technology, and numbering all pieces of comment text data from small to large according to a time sequence;

the LTP tool package is adopted to respectively process word segmentation, part of speech tagging and stop word removal of each piece of comment text data, and the coding technology is adopted to code the processing results of all pieces of comment text data;

taking the number of each comment text data as the number of a transaction, and taking the coding result of each comment text data as a transaction item set of the transaction corresponding to each comment text data; and taking the database formed by all the transactions as an original transaction library.

Preferably, the method for determining the transaction submatrix of each frequent 1-item set according to the distribution of the transaction items in different transactions where each frequent 1-item set is located is as follows:

obtaining all frequent 1-item sets in the original transaction library based on the original transaction library by adopting a data mining algorithm;

taking a vector formed by assignment results of all frequent 1-item sets in the transaction item set of each transaction as a row vector, and taking a matrix formed by row vectors corresponding to all the transactions as a transaction library matrix;

And taking any one frequent 1-item set as a target item set, setting element values in all rows which do not contain the target item set in the transaction library matrix as 0, keeping the element values in all rows which contain the target item set in the transaction library matrix unchanged, and taking the result after the element values in all rows in the transaction library matrix are reset as a transaction submatrix of the target item set.

Preferably, the method for determining the item set semantic similarity of each FP sub-tree according to the similarity of element distribution in the transaction submatrices of different frequent 1-item sets comprises the following steps:

taking a set formed by all row vector corresponding transactions with the existence value of 1 of the transaction submatrix of each frequent 1-item set as a matching transaction library of each frequent 1-item set, and taking an FP tree constructed based on the matching transaction library of each transaction by using an FP-growth algorithm as an FP subtree of each frequent 1-item set;

taking a vector formed by coding results corresponding to transaction items in any transaction in a matching transaction library of each frequent 1-item set as a semantic coding vector; using the set of semantic coding vectors corresponding to all transactions in the matching transaction library of each frequent 1-item set as the semantic coding set of each frequent 1-item set;

determining semantic similarity based on a metric distance between the semantic coding set of each frequent 1-item set and the semantic coding set of any one of the remaining frequent 1-item sets;

Taking a data pair consisting of the number of rows and the number of columns of each element with the value of 1 in the transaction submatrix of each frequent 1-item set as an ordinal number pair, and taking a set consisting of ordinal number pairs corresponding to all elements with the value of 1 in the transaction submatrix of each frequent 1-item set as a data position set of each frequent 1-item set;

determining a semantic distribution proximity coefficient based on a measured distance between the data location set of each frequent 1-item set and element distribution in the data location set of any one of the rest frequent 1-item sets;

the item set semantic similarity of each FP subtree consists of two parts of semantic similarity and semantic distribution proximity coefficient, wherein the item set semantic similarity forms positive correlation with the semantic similarity and the semantic distribution proximity coefficient respectively.

Preferably, the method for determining the rule potential coefficient of the FP subtree according to the similarity degree between nodes on each link path in the FP subtree corresponding to each frequent 1-item set comprises the following steps:

marking each frequent 1-item set as a target item set, marking a node where the target item set is located on an FP subtree corresponding to each frequent 1-item set as a target node, taking each link passing through the target item set on the FP subtree of each frequent 1-item set as a main link, and taking the average value of the support degree of the target item set on each main link and the support degree of the item set in any node on the link as a first average value; taking the mean value of the part-of-speech weight corresponding to the target item set on each main link and the part-of-speech weight corresponding to the item set in any node on the link as a second mean value;

Taking the product of the semantic distribution approach coefficient between the target item set on each main link and the item set in any node on the link and the first mean value and the second mean value as the data association coefficient of any node on each main link;

taking the average value of the data association coefficients of all nodes on each main link as a first calculation factor; taking the average value of the accumulation results of the first calculation factors on all main links on each FP subtree as a rule association coefficient of the target node;

the rule potential coefficients of each FP subtree consist of two parts of item set semantic similarity and rule association coefficients, wherein the rule potential coefficients are respectively in direct proportion to the item set semantic similarity and the rule association coefficients.

Preferably, the method for determining the text information divergence index of the FP subtree according to the similarity degree of semantic information carried by the item sets in the nodes on each link path in the FP subtree corresponding to each frequent 1-item set comprises the following steps:

taking an undirected graph determined by taking each node on each main link as one node in the undirected graph as a node distribution graph, and obtaining a node distribution vector of each node in each node distribution graph by adopting a depth migration algorithm;

Taking the sum of the number of descendant nodes and ancestor nodes of the node where the target item set is located on each main link as a molecule;

taking the average value of the accumulated results of the mapping results of the measurement results between the node where the target item set is located and the distribution vectors of the nodes corresponding to the rest nodes on each main link as a denominator; taking the ratio of the numerator to the denominator as the node divergence index of each main link;

and taking the average value of node divergence indexes of all main links on each frequent 1-item set corresponding FP subtree as the text information divergence index of the FP subtree.

Preferably, the method for obtaining redundancy probability of each node on the FP subtree according to the rule potential coefficient and the text information divergence index of the corresponding FP subtree in each frequent 1-item set comprises the following steps:

taking the product of the rule potential coefficient and the text information divergence index of each FP subtree as the rule mining capability index of each FP subtree;

taking a natural constant as a base number, and taking a normalization result of a calculation result with the difference value between the rule mining capability index after deleting each node on each FP subtree and the rule mining capability index of each FP subtree as an index as a redundancy probability of each node on each FP subtree.

Preferably, the method for obtaining the emotion association rule by adopting the data mining algorithm based on the redundancy probability of all nodes on the corresponding FP subtree in each frequent 1-item set comprises the following steps:

taking redundancy probability of all nodes on corresponding FP subtrees in each frequent 1-item set as input, and obtaining a segmentation threshold of the redundancy probability by adopting a threshold segmentation algorithm;

deleting all nodes with redundancy probability larger than the segmentation threshold value on the corresponding FP subtrees in each frequent 1-item set, updating the count value of the item set in the deleted node in the original transaction library, and obtaining the emotion association rule based on the updating result of the original transaction library by adopting an FP-growth algorithm.

Preferably, the method for implementing knowledge enhancement of the large language model for emotion analysis based on emotion association rules comprises the following steps:

and acquiring extended background knowledge related to comment text data by using emotion association rules, and taking the extended background knowledge as priori input to realize knowledge enhancement of a large language model for emotion analysis.

In a second aspect, an embodiment of the present invention further provides a large language model knowledge enhancement system, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the steps of any one of the large language model knowledge enhancement methods described above when the processor executes the computer program.

The beneficial effects of the invention are as follows: according to the method, the transaction submatrix is obtained through the expression emotion semantic information and the sentence structure of each word in different comment text data, and the item set semantic similarity of each FP subtree is determined based on the similarity of element distribution in different transaction submatrices; secondly, determining rule potential coefficients of each FP subtree through item set semantic similarity of each FP subtree and similarity degree between nodes on each link path in the FP subtree; secondly, determining a text information divergence index according to the similarity degree of semantic information carried by the item sets in the nodes on each link path in each frequent 1-item set corresponding FP subtree, wherein the text information divergence index can accurately reflect the capability of the words corresponding to the frequent 1-item sets on each subtree to accept the context information in the comment text; secondly, determining redundancy probability through rule potential coefficients of each subtree and text information divergence indexes, performing node pruning through the redundancy probability to improve the effectiveness of the obtained emotion analysis association rule, secondly, performing word replacement, word combination and theme expansion on comment text data through the emotion analysis association rule of the comment text data to obtain expanded background knowledge, realizing knowledge enhancement for emotion analysis of a large language model through the expanded background knowledge, and taking the expanded background knowledge as priori input of the large language model to enable the large language model to learn more emotion knowledge in the comment text data.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of a knowledge enhancement method for a large language model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a node distribution diagram according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a flowchart of a large language model knowledge enhancement method according to an embodiment of the invention is shown, the method includes the following steps:

And S001, acquiring an original transaction library based on the word segmentation processing result of the existing comment text data.

The large language model is applied to a plurality of fields such as information retrieval and question answering, automatic abstracting, emotion analysis, public opinion monitoring and the like.

And (3) acquiring the comment text data of the N delicacies from the scoring software by utilizing a crawler technology, numbering the N comment text data from small to large according to a time sequence, namely, the acquired first comment text data is numbered 1, the acquired nth comment text data is numbered N, the size of the N is checked to be 10000, the crawler technology is a known technology, and the specific process is not repeated. For the obtained comment text data, processing of word segmentation, part of speech tagging and disabling words on any piece of comment text data by using an LTP tool kit, specifically: firstly, word segmentation processing is carried out on N pieces of comment text data by using an LTP tool kit, and a text word set corresponding to each piece of comment text data is obtained; and marking the parts of speech of each word in each obtained text word set, and finally, respectively processing each text word set by using a Hai-Gong stop word list, and deleting stop words such as word and adverbs which have no influence on word semantics.

Further, taking each piece of comment text data as a transaction, taking the number of each piece of comment text data as the number of the corresponding transaction, and encoding the processing result of each piece of comment text data through the steps by adopting a UTF-8 (Unicode Transformation Format) encoding technology; the database formed by all N pieces of transactions is used as an original transaction library, and the UTF-8 coding technology is a known technology, and the specific process is not repeated.

Thus, an original transaction library is obtained for subsequent mining of association rules from transaction items of each transaction.

Step S002, determining a transaction sub-matrix according to the distribution of transaction items in different transactions, and determining the item set semantic similarity of each FP sub-tree according to the similarity of element distribution in different transaction sub-matrices.

In order to achieve knowledge enhancement of the emotion analysis large language model, strong association rules related to emotion are required to be mined from comment text data, and a training set for training the emotion analysis large language model is acquired according to the obtained strong association rules. Specifically, an original transaction library is taken as input, a term support threshold in an algorithm is set to be 0.2, and an FP-growth algorithm is adopted to acquire all frequent 1-term sets in each transaction in the original transaction library, wherein the FP-growth algorithm is a known technology, and the specific process is not repeated.

And (3) assigning all frequent 1-item sets in each transaction in the original transaction library as 1, assigning all non-frequent 1-item sets as 0, taking a vector formed by the assignment results of all item sets in each transaction in the original transaction library as a row vector, and taking a matrix formed by the corresponding row vectors of all transactions in the original transaction library as a transaction library matrix. Further, for any one frequent 1-item set, the a-th frequent 1-item set is usedFor example, there is no +.>The element value in any row of (1) is set to 0, and there is +.>The element values in any row of the transaction library matrix are kept unchanged, and the result after the element values in all rows in the transaction library matrix are reset is used as frequent 1-item set +.>Is a transaction sub-matrix of (a). Second, frequent 1-item set +.>Data pairs consisting of the number of rows and columns in which each element with value 1 is located in the transaction submatrix of (1) are taken as ordinal pairs>Is set of ordinal pair components corresponding to all elements with value 1 in the transaction submatrix of +.>Is provided).

Second, frequent 1-item setsAll sets of row vector corresponding transaction components with the existence value of 1 are used as frequent 1-item set +.>Is to be frequently 1-item set +. >A vector formed by the coding results corresponding to the transaction items in each transaction in the matched transaction library is used as a semantic coding vector; frequent 1-item set->Is used as frequent 1-item set +.>Is a semantic code set of (c). If the emotion polarity and semantic information expressed by the text data of the comments corresponding to the two transactions are more similar, the positions and the number of frequent 1-item sets in the line vectors corresponding to the two transactions are more similar.

Specifically, frequent 1-item setsThe matching transaction library of (2) is used as input, the item set support threshold value in the algorithm is set to be 0.2, and an FP tree obtained by using an FP-growth algorithm is used as frequent 1-item set +.>FP subtree->The FP-growth algorithm is a well-known technique, and the specific process is not described in detail. If the semantic information of the words corresponding to the two frequent 1-item sets and the context information associated in the comment text data are more consistent, the emotion information expressed by the words corresponding to the two frequent 1-item sets in the comment text data is more similar, and the tree structures between the FP subtrees corresponding to the two frequent 1-item sets are more similar.

Based on the analysis, the item set semantic similarity is constructed here and used for representing the similarity degree between the words corresponding to different frequent 1-item sets carrying emotion information and semantic information. Computing FP subtrees Item set semantic similarity +.>：

In the method, in the process of the invention,is the semantic similarity between the a-th and b-th frequent 1-item sets, +.>、/>The number of semantic coding vectors in the corresponding semantic coding sets of the a-th and b-th frequent 1-item sets are +.>Is the m-th semantic coding vector in the corresponding semantic coding set of the a-th frequent 1-item set,/->Is the nth semantic coding vector in the corresponding semantic coding set of the b frequent 1-item set,/->Is a semantically encoded vector +.>、/>Cosine similarity between the two is a known technology, and the specific process is not repeated;

is the semantic distribution proximity coefficient between the a-th and b-th frequent 1-item sets,/->Is an exponential function based on natural constant, < ->、/>The data position sets corresponding to the a-th and b-th frequent 1-item sets are respectively,is the data location set +.>、/>Jaccard coefficients between the distribution of the internal elements are known techniques, and detailed description thereof is omitted;

is FP sub-tree->Is the number of frequent 1-item sets in the original transaction library.

Wherein, the more similar the semantic information of the corresponding comment text data of the transaction in the matching transaction library of the two frequent 1-item sets is, the more similar the sentence structure of the comment text data is, the higher the similarity between the corresponding semantic coding vectors is, The greater the value of +.>The greater the value of (2); the larger the difference between the parts of speech and emotion polarities of the words corresponding to the two frequent 1-item sets in comment text data is, the more similar the distribution of ordinal pairs in the data position sets corresponding to the a-th and b-th frequent 1-item sets is, the larger the difference of the position distribution of the elements with the value of 1 in the transaction submatrix of the two frequent 1-item sets is,the smaller the value of +.>The greater the value of (2), the corresponding, +.>The greater the value of (2).

So far, the item set semantic similarity of each FP subtree is obtained and is used for subsequently determining the rule potential coefficient of each FP subtree.

And step S003, determining rule potential coefficients of the FP subtrees according to the similarity degree between nodes on each link path in the FP subtrees corresponding to each frequent 1-item set.

In comment text data, the parts of speech of nouns and adjectives are important, wherein the nouns can reflect entities such as food, dishes, environment and the like mentioned by customers, the adjectives can reflect the evaluation of the customers on the aspects of food, environment, service and the like, and the habitual collocation and behavior patterns of the customers in the comments can be found by mining the association rules between the nouns and the adjectives in the comments, for exampleThe inner face is delicious, and the part of speech of the face in the comment text data is a noun, and the face is a comment object; the delicious part of speech is an adjective, directly reflects the emotion of the reviewer, and is given a larger part of speech weight; and here the impact on the results of the emotion analysis is weaker, a smaller part-of-speech weight should be given. Thus, according to the type of the collected text data, the frequent 1-item sets According to the difference of parts of speech, the invention gives a bigger part of speech weight to the word items whose parts of speech are nouns and adjectives ++>，/>Taking the experience value as 2, giving smaller part-of-speech weight to the word items with the rest parts of speech>，/>The empirical value was taken to be 1.

Further, in the FP sub-treeNot all links go through frequent 1-term sets when rule mining is performedThe node where the frequent 1-item set corresponds to the semantic information and emotion information contained in the words can be changed in different comment text data where the semantic information and emotion information are located. When knowledge enhancement is performed on the emotion analysis large language model, words which are closer to background knowledge of each word in word segmentation results of each comment text data are to be found for enhancement. For example, the three pieces of comment text data are respectively: the flavoring agent for the stretched noodles and the mixed noodles is a signboard, three stripsThe emotion of the comment text data is positive emotion, sentence structures of the first two pieces of comment text data are very similar, but background knowledge of comment objects can be distinguished due to the fact that the comment objects are different, and knowledge enhancement is carried out on the second piece of comment text data by utilizing association rules corresponding to the third piece of comment text data.

Specifically, the FP sub-tree isEach item goes through frequent 1-item set +.>Is used as a main link, and frequent 1-item sets are evaluated according to the difference of the part of speech and the support degree of the words corresponding to the item sets in the node and the rest nodes on each main link>The degree of influence on the association rule in which it is located.

Based on the analysis, a rule potential coefficient is constructed here and used for representing the influence degree of each frequent 1-item set on the mining association rule in the FP subtree corresponding to the frequent 1-item set. Computing FP subtreesIs a rule latent coefficient of (1):

in the method, in the process of the invention,is +.>Data association coefficient of the node and the f node,>is frequent 1-item set->A semantic distribution proximity coefficient to a frequent 1-item set within said f-th node,>is frequent 1-item setMean value of part-of-speech weights corresponding to frequent 1-item sets in the f-th node,/->Is frequent 1-item set->The average value of the support degrees corresponding to the frequent 1-item sets in the f node;

is->Rule association coefficient of the node where the rule is located, +.>Is FP sub-tree->The number of upper primary links->Is the average value of the data association coefficients of all nodes on the c-th main link;

is FP sub-tree->Rule potential coefficients of>Is FP sub-tree- >Semantic similarity of item sets of (a).

The more comment text data containing words corresponding to frequent 1-item sets in the f-th node on the c-th main link and frequent 1-item sets occupy a larger proportion, and the more frequent 1-item sets in the f-th node on the c-th main link and frequent 1-item sets easily exist in a plurality of association rulesThe larger the part-of-speech weight corresponding to the frequent 1-item set in the f-th node is, the larger the support degree is, the first mean +.>Second mean->The greater the value of (2); />The closer the distribution of elements in the data position set corresponding to the frequent 1-item set in the f-th node is, the larger the value of the semantic distribution approach coefficient is; through->The greater the number of main links of the node at which the sub-tree is located +.>The easier the middle part is to dig out and +.>Related strong association rule, first calculation factor +.>The greater the value of +.>The greater the value of (2).

So far, the rule potential coefficient of each subtree is obtained and is used for subsequently determining the redundancy probability of each node on each subtree.

Step S004, determining a text information divergence index of each FP subtree according to the similarity degree of semantic information carried by an item set in a node on each link path in the FP subtree; and obtaining redundancy probability of each node on each FP subtree according to the rule potential coefficient and the text information divergence index of each FP subtree.

In order to achieve knowledge enhancement of emotion analysis large language models, strong association rules related to words of different emotion polarities should exist in association rules mined from comment text data. The more the obtained strong association rules are, the more the background knowledge can be learned, and the better the knowledge enhancement effect is. In the process of mining the association rule, the influence degree of each frequent 1-item set in the association rule is different, and the comment text data with obvious emotion polarity is easier to generate as the semantic combination capability of words with more context information is stronger in the comment text data. For example, golden yellow can be good in color and luster of cakes, and can also be good in frying temperature and heat, so that golden yellow can form comment text data with positive emotion polarities with a plurality of comment objects. Correspondingly, the more words of the adapting context information, the stronger the rule association capability of the frequent 1-item set corresponding to the words.

Further, for FP subtreesTaking a c-th main link as an example, taking each node on the c-th main link as one node in the undirected graph, taking Jaccard coefficients between element distributions in data position sets corresponding to item sets in two nodes on the c-th main link as weights of connecting edges between the corresponding two nodes, and taking an undirected graph determined based on all nodes on the c-th main link as a node distribution graph of the c-th main link >The node distribution diagram is constructed as shown in fig. 2. Second, the node distribution diagram of the c-th main link is +.>As input, a node profile +.>The node distribution vector of each node in the depth migration algorithm is set to 30, the number of random walks of each node in the depth migration algorithm is set to 10, the dimension of the feature vector is set to 128, the step length of the random walks is set to 20, and the depth migration algorithm is a known technology and the specific process is not repeated.

Based on the analysis, a text information divergence index is constructed here to characterize the similarity of semantic information carried by the item sets in the nodes on each link path in each subtree. Computing FP subtreesText information divergence index of (c):

in the method, in the process of the invention,is the node divergence index of the c-th main link, num is the frequent 1-item set +.>The sum of the number of descendant nodes and ancestor nodes of the node where the node is located, +.>Is the number of nodes in the c-th main link, is->、/>Node profile +.>Frequent in1-item set->Node distribution vector of the located node, the f-th node,>is cosine similarity between node distribution vectors, < >>Is an exponential function based on natural constants;

is FP sub-tree- >Is a text information divergence index,/-, of>Is FP sub-tree->The number of primary links.

Wherein the node distribution mapThe farther the distance between two nodes, the larger the distribution difference of elements in the data position sets corresponding to the two nodes, the frequent 1-item set +.>The smaller the data association coefficient between the node and the f node is, the node distribution diagram +.>The lower the node distribution vector similarity of two nodes in the list, +.>The smaller the value of (2); frequent 1-item set->Contains frequent 1-item set +.>The more comment text data of the corresponding word, the +.>Through frequent 1-item set->The more primary links of the node are located, the larger the num value is; i.e. < ->The greater the value of (2), the more frequent 1-item setsThe stronger the semantic diffusion capability of the corresponding word is, the easier the comment text data with obvious emotion polarity is formed with the rest words.

And according to the steps, the rule potential coefficient and the text information divergence index of each frequent 1-item set corresponding to each subtree are respectively obtained. In order to generate stronger association rules aiming at words with different emotion polarities, knowledge enhancement is realized. For any one FP subtree, if the rule potential coefficient and the text information divergence index of the FP subtree are not obviously changed before and after deleting a node, the influence of the emotion semantics expressed by the corresponding words of the item set in the node on the emotion semantics of comment text data where the item set is positioned is weaker, and a stronger association rule is difficult to generate.

Based on the analysis, the redundancy probability of each node on each FP subtree is determined according to the obtained rule potential coefficient and text information divergence index of each subtree, and node pruning is realized according to the redundancy probability of each node. Computing FP subtreesRedundancy probability of the kth node above:

in the method, in the process of the invention,is FP sub-tree->Rule mining ability index of>、/>The FP subtrees->The text information divergence index, rule latent coefficient of (a);

is FP sub-tree->Redundancy probability of the kth node on, +.>Is a normalization function->Is an exponential function based on natural constant, < ->Is FP sub-tree->The rule mining ability index after the kth node is deleted is obtained.

Wherein, FP subtreeBefore and after deleting FP subtree on kth node +.>Rules of (a)The more the excavation ability index decreases,the smaller the value of +.>The smaller the value of (2), the smaller the redundancy probability of the corresponding kth node, the +.>The lower the probability of pruning by the kth node.

So far, the redundancy probability of each node on each subtree is obtained, and the association rule in comment text data can be acquired conveniently.

Step S005, obtaining emotion association rules by adopting a data mining algorithm based on redundancy probabilities of all nodes on each FP subtree; knowledge enhancement of large language models for emotion analysis is achieved based on emotion association rules.

And according to the steps, the redundancy probability of each node on the FP subtree corresponding to each frequent 1-item set in the original transaction library is respectively obtained. And secondly, taking the redundancy probability of all nodes on the corresponding FP subtrees in each frequent 1-item set as input, and obtaining a segmentation threshold value of the redundancy probability of all nodes on each FP subtree by adopting an Ojin threshold algorithm, wherein the Ojin threshold algorithm is a known technology, and the specific process is not repeated.

Further, deleting all nodes with redundancy probabilities larger than the segmentation threshold value on the corresponding FP subtrees in each frequent 1-item set, and updating the count value of the item set in the deleted node in the original transaction library, namely subtracting 1 from the count value of the frequent 1-item set in each deleted node on each FP subtree. For example, frequent 1-item setsCorresponding FP sub-tree->Redundancy probability of the kth node on +.>Greater than the segmentation threshold, then the FP sub-tree is +.>The count value of the frequent 1-item set in the kth node is subtracted by 1. Traversing all nodes on all FP subtrees, updating count values of item sets of all transactions in an original transaction library, taking the updated original transaction library as input, setting a threshold value of item set support degree in an FP-growth algorithm as 0.2, setting a threshold value of association rule confidence degree as 0.5, and taking an association rule of the original transaction library obtained by using the FP-growth algorithm as an emotion association rule in comment text data, wherein the FP-growth algorithm is a known technology, and a specific process is not repeated.

Further, knowledge enhancement is carried out on the emotion analysis large language model according to emotion association rules in the comment text data. Specifically, the joint occurrence frequency p of any two frequent item sets in the emotion association rule is counted, and the joint occurrence probability is the ratio of the number of the emotion association rules with any two frequent item sets to the number of all emotion association rules. For example, for frequent item sets、/>There is frequent item set +.>、/>The number of emotion association rules is 100, and the total of emotion association rules is 1000, then the frequent item set is +.>、/>Is->0.1. Obtaining the mean value of all the combined occurrence frequencies p>For any one greater than the mean +.>Combining the words corresponding to the two frequent item sets corresponding to the combined occurrence frequency to form a phrase, and adding all the words greater than the average value +.>The phrase corresponding to the joint occurrence frequency of the N pieces of comment text data is used as the extended background knowledge.

In another embodiment, for any one greater than the meanThe method comprises the steps of (1) mutually replacing words corresponding to two frequent item sets corresponding to the combined occurrence frequency to obtain new comment text data, so that a new data sample can be constructed by combining the words in the frequent item sets, knowledge enhancement is realized, and all words larger than average value +. >New comment text data formed after frequent item set replacement in the joint occurrence frequency is used as the extended background knowledge of the N pieces of comment text data.

Besides the two embodiments, the words corresponding to all frequent item sets in the emotion association rule in the comment text data can be used as a strong association word set, the strong association word set is used as input, the number of recognition topics is set to be 100, the LDA (Latent Dirichlet Allocation) topic recognition model is adopted to obtain the words contained in each topic in the strong association word set, the LDA recognition model is a known technology, and the specific process is not repeated. Secondly, ordering all topics according to the ascending order of the number of the included words, marking the first 20 topics in the ordering result as topics to be expanded, taking sentences consisting of words corresponding to all frequent item sets in any emotion association rule where the words are included in each expanded topic as an expanded topic sentence, adding all expanded topic sentences corresponding to all the topics to be expanded as expanded background knowledge of N emotion analysis texts into the acquired N comment text data, and thus constructing new topic comment text data by combining the topics in the frequent item sets to realize knowledge enhancement.

Further, the obtained expanded background knowledge is used as prior input, and a data set formed by all comment text data is used as a training set to be input into a large language model for emotion analysis, so that knowledge enhancement of the large language model is realized.

Based on the same inventive concept as the above method, the embodiment of the present invention further provides a large language model knowledge enhancement system, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the steps of any one of the above large language model knowledge enhancement methods when executing the computer program.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. The foregoing description of the preferred embodiments of the present invention is not intended to be limiting, but rather, any modifications, equivalents, improvements, etc. that fall within the principles of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for large language model knowledge enhancement, the method comprising the steps of:

obtaining redundancy probability of each node on the FP subtree according to rule potential coefficients and text information divergence indexes of the corresponding FP subtree in each frequent 1-item set; adopting a data mining algorithm to obtain emotion association rules based on redundancy probabilities of all nodes on corresponding FP subtrees in each frequent 1-item set; knowledge enhancement of a large language model for emotion analysis is realized based on emotion association rules;

the method for determining the rule potential coefficient of the FP subtree according to the similarity degree between nodes on each link path in the FP subtree corresponding to each frequent 1-item set comprises the following steps:

the rule potential coefficients of each FP subtree consist of two parts of item set semantic similarity and rule association coefficients, wherein the rule potential coefficients are respectively in direct proportion to the item set semantic similarity and the rule association coefficients;

The method for determining the text information divergence index of the FP subtree according to the similarity degree of semantic information carried by the item sets in the nodes on each link path in the FP subtree corresponding to each frequent 1-item set comprises the following steps:

2. The method for enhancing knowledge of a large language model according to claim 1, wherein the method for obtaining an original transaction library based on the word segmentation result of the existing comment text data comprises the following steps:

3. The method for enhancing knowledge in a large language model according to claim 1, wherein the method for determining the transaction submatrix of each frequent 1-item set according to the distribution of transaction items in different transactions in which each frequent 1-item set is located is as follows:

4. The method for enhancing knowledge in a large language model according to claim 1, wherein the method for determining the semantic similarity of the item sets of each FP sub-tree according to the similarity of element distribution in the transaction sub-matrix of different frequent 1-item sets is as follows:

the item set semantic similarity of each FP subtree consists of two parts of semantic similarity and semantic distribution proximity coefficients, wherein the item set semantic similarity forms positive correlation with the semantic similarity and the semantic distribution proximity coefficients respectively;

5. The method for enhancing knowledge of a large language model according to claim 1, wherein the method for obtaining redundancy probability of each node on FP subtree according to rule latent coefficients and text information divergence indexes of corresponding FP subtrees in each frequent 1-item set is as follows:

6. The method for enhancing knowledge of a large language model according to claim 1, wherein the method for obtaining emotion association rules by using a data mining algorithm based on redundancy probabilities of all nodes on corresponding FP-subtrees in each frequent 1-item set comprises:

7. The method for enhancing knowledge of a large language model according to claim 1, wherein the method for enhancing knowledge of a large language model for emotion analysis based on emotion association rules comprises the steps of:

8. A large language model knowledge enhancement system comprising a memory, a processor and a computer program stored in said memory and running on said processor, characterized in that said processor, when executing said computer program, implements the steps of a large language model knowledge enhancement method according to any one of claims 1-7.