CN114881017A

CN114881017A - Self-adaptive dynamic word segmentation method

Info

Publication number: CN114881017A
Application number: CN202210441833.6A
Authority: CN
Inventors: 王峥; 杨梦玲; 武志彦; 董文君; 臧高峰; 陈虎
Original assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Current assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date: 2022-04-25
Filing date: 2022-04-25
Publication date: 2022-08-09

Abstract

The invention discloses a self-adaptive dynamic word segmentation method, which comprises the following steps: s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain; s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation; and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result. According to the self-adaptive dynamic word segmentation method, more domain-specific words are automatically mined through multi-model fusion, a domain dictionary is enriched, a pattern matching score effect is dynamically calculated, features extracted through pattern matching are combined with text semantics, the domain classification precision is improved, and word segmentation effects in different domains are improved.

Description

Self-adaptive dynamic word segmentation method

Technical Field

The invention relates to the technical field of word segmentation systems, in particular to a self-adaptive dynamic word segmentation method.

Background

Along with the popularization of computers, the importance of information is increasing day by day, and in the face of the data of the eight people in the network, how to mine the implicit information in the data and enable the data to exert the maximum value is an exploration hotspot of each person, the typical application of the data is a search engine, an intelligent question and answer, a knowledge graph and the like, all the applications are based on a word segmentation technology, and the word segmentation is performed, so that the word segmentation is equal to the perfect connection with a first bar. Currently, the known word segmentation systems include nubar, hahara LTP, HanLP, Stenford CoreNLP, and the like, and the adopted word segmentation technologies include methods based on dictionary matching, statistics, deep learning and the like.

However, in actual application, data in different fields should have different segmentation dimensions, for example, the word "K powder" should be segmented into two words "K/powder" in the general field, and the word "ketamine" should be segmented in the chemical field, is an anesthetic, is often taken as a drug to be sucked, and has an important effect on drug involvement, law enforcement and the like. However, in the word segmentation method based on dictionary matching in the industry, words are mostly put in one file, words are directly segmented according to a single dictionary, or a plurality of dictionaries are set, but the dictionaries are searched according to a fixed sequence during word segmentation, so that data segmentation dimensions in different fields are the same, and word segmentation generalization in a special field is insufficient.

In order to improve the generalization effect of word segmentation, word segmentation with different dimensions is performed on data in different fields, and a field classification technology combining field-specific vocabulary, pattern matching and deep learning is urgently needed to be designed to dynamically adjust a dictionary on which word segmentation depends so as to adaptively select proper segmentation dimensions according to the data in different fields. Therefore, we improve this and propose an adaptive dynamic word segmentation method.

Disclosure of Invention

In order to solve the technical problems, the invention provides the following technical scheme:

the invention discloses a self-adaptive dynamic word segmentation method, which comprises the following steps:

s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain;

s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation;

and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result.

As a preferred technical solution of the present invention, the domain-specific word matching module in step S1 includes two processes of domain-specific word generation and domain-specific word matching, and the specific steps of the domain-specific word generation are as follows:

s1.1, preparing a field corpus and a non-field corpus;

s1.2, performing primary word segmentation on the domain linguistic data and the non-domain linguistic data respectively, and obtaining a domain word set and a non-domain dictionary by directly adopting jieba word segmentation;

s1.3, filtering stop words in the field word set;

s1.4, the granularity of common participles of jieba is usually very small, and adjacent words can be combined into new words, and the method comprises the following three methods:

the method comprises the following steps: and (3) testing a model:

wherein the content of the first and second substances,

is the mean value of the samples, s ² Is the sample variance, N is the size of the sample, u is the mean of the distribution; at the moment, the zero hypothesis is that the n-element phrases are independent, and all the element groups are calculated in a traversal modeT statistic to quadruple, at the level of confidence α of 0.005, for statistic t>2.576, we can have 99.5% confidence to reject the null hypothesis, i.e. have 99.5% confidence to consider the word true;

the second method comprises the following steps: a solidification degree model:

if the probability that the words X and Y appear together is divided by the probability value of the respective appearance, the probability that the words XY appear together is the highest;

the third method comprises the following steps: a degree of freedom model: h (u) ═ Σ _i p _i logp _i

If the words X and Y appear more and more at two sides, namely the degree of freedom of taking words at two sides is higher, the words XY are more independent, and if the degree of freedom at one side is very low, the words XY do not appear independently and can be part of XYZ words;

s1.5, selecting a certain amount of words from the high confidence level to the low confidence level by using the sequencing results of the three methods in the S1.4, adding the words into a jieba word segmentation user-defined dictionary, carrying out word segmentation on the material data again, and calculating by using a word2vec model to obtain a space vector of each word;

s1.6, taking the intersection of the results of the three methods in S1.4 as a seed word, and taking the rest words as candidate new words;

s1.7, for each seed word, sorting and selecting words with higher similarity in the candidate new words according to the similarity, voting and sorting according to voting results to obtain field keywords;

and S1.8, taking a difference set of the domain keywords and the non-domain dictionary to finally obtain the domain dictionary.

As a preferred technical solution of the present invention, the specific process of the domain-specific vocabulary matching is as follows:

if the exclusive word of a single specific field is matched, directly loading the field dictionary and the common dictionary to combine to complete word segmentation;

if a plurality of domain exclusive dictionaries are matched at the same time, segmenting the text, and if each segment of text only has a specific domain dictionary, segmenting to complete word segmentation;

if there are still multiple domain dictionaries in an individual paragraph, then enter the domain matching mode.

As a preferred technical solution of the present invention, the domain pattern matching module in step S2 includes a domain matching pattern and a pattern matching, where the domain matching pattern is used to preset some matching patterns for each domain and limit the distance between the preceding word and the following word to be less than 15.

As a preferred technical solution of the present invention, the specific process of pattern matching is as follows:

s2.1, loading a preset field matching mode, and performing mode matching on the input text;

s2.2, matching with a special vocabulary, and if only matching with the mode of a single field, directly judging that the input text is the field;

s2.3, if the modes of a plurality of fields are in accordance with each other, performing segmentation processing, and performing mode matching on the text of each section independently;

s2.4, if the field conflict still exists in a certain text, a field scoring function is designed according to the scheme, and the method specifically comprises the following steps: suppose that the pattern with the largest number of successful matching patterns in a certain field is marked as match _max Many times marked as match _sec The minimum is marked as match _min Then the score function score is:

score is dynamically adjusted with the result of pattern matching, match _max With match _sec 、match _min The larger the difference, the larger score, i.e. the more prominent the mode features of a certain domain; conversely, the smaller score, the less distinct the mode features;

and S2.5, if the score value is more than 0.85, judging that the input text is the corresponding field, otherwise, recording the successfully matched mode of each field, adding a mode feature list, and entering a field classification module.

As a preferred technical solution of the present invention, the domain classification module in step S3 is used for classifying the text without the related domain-specific words and patterns, and for the modification and complementation of the pattern matching result, for the features of the input text, a HAN network is used, and its multi-layer attention mechanism can not only pay attention to the "words" and find out the important word components in the sentences, but also pay attention to the "sentences" and find out the important sentence components in the text.

As a preferred technical solution of the present invention, the word segmentation module in step S2 adopts a word segmentation method based on a dual-tuple tree, which specifically includes:

a) the structure process of the dual-array wire tree is as follows:

the transition from state s to t for the received character c, the improved storage conditions in the even array are: base [ s ] + c ═ t, check [ t ] ═ base [ s ];

establishing a root node root, and enabling base [ root ] to be 1;

finding child node set of root (root _i N) so that check is done _i ]＝base[root]＝1；

And iv, performing the following operation on each element in the root, child:

child ren was found _i N) if a character is at the end of the sequence, its child node comprises a null node with its code value set to 0, and a value begin is found such that every check [ begin } _i +element.children _i .code]＝0；

Set base [ element _i ]＝begin _i ；

Child element _i Step iv is executed recursively, if there is no leaf node child in a certain element, then base [ element ] is set]Is a negative value;

b) word segmentation:

reading texts to be word-segmented, traversing backwards in sequence, calculating according to the condition of i in the construction process of the double-array, when base [ s ] is t, indicating that c is 0, recording the position index, and then Dic [ index ] is the words in the matched domain dictionary.

The invention has the beneficial effects that:

according to the self-adaptive dynamic word segmentation method, more domain-specific words are automatically mined through multi-model fusion, a domain dictionary is enriched, a pattern matching score effect is dynamically calculated, and features extracted through pattern matching are combined with text semantics, so that a domain classification effect is improved; finally, the dependent dictionary is adaptively adjusted according to the field, so that the model is not limited to a segmentation mode any more, and word segmentation dimensionality can be intelligently selected according to the data field.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a general flow diagram of an adaptive dynamic word segmentation method of the present invention;

FIG. 2 is a flow chart of domain specific vocabulary generation for an adaptive dynamic word segmentation method of the present invention;

FIG. 3 is a diagram illustrating a new word combination in accordance with an adaptive dynamic word segmentation method of the present invention;

FIG. 4 is a schematic diagram of an algorithm network structure of an adaptive dynamic word segmentation method according to the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example (b): as shown in fig. 1, the invention relates to a self-adaptive dynamic word segmentation method, which comprises the following steps:

The domain specific word matching module in step S1 includes two processes of domain specific word generation and domain specific word matching, the flow chart of the domain specific word generation is shown in fig. 2, and the specific steps are as follows:

s1.1, preparing a field corpus and a non-field corpus;

s1.3, filtering stop words such as ' used ', ' and the like in the field word set;

s1.4, the granularity of common participles of jieba is usually very small, as shown in FIG. 3, several adjacent words can be combined into new words, and the following three methods are included:

the method comprises the following steps: and (3) testing a model:

wherein, the first and the second end of the pipe are connected with each other,

is the mean value of the samples, s ² Is the sample variance, N is the size of the sample, u is the mean of the distribution; at this time, the zero hypothesis is that the n-element phrase appears independently, t statistics of all one-element to four-element are calculated in a traversal mode, and at the level that the confidence degree alpha is 0.005, the statistics t>2.576, we can have 99.5% confidence to reject the null hypothesis, i.e. have 99.5% confidence to consider the word true;

the second method comprises the following steps: a solidification degree model:

The specific process of the field-specific vocabulary matching is as follows:

The domain pattern matching module in step S2 includes a domain matching pattern and a pattern matching, where the domain matching pattern is used to preset some matching patterns for each domain and limit the distance between the preceding word and the following word to be less than 15, for example, the matching pattern of the traffic accident domain may be: occurrence of, cause of, etc.

The specific flow of pattern matching is as follows:

s2.2, matching with exclusive vocabularies, and if only matching with the mode of a single field, directly judging that the input text is the field;

The domain classification module in step S3 is used to classify the text without the specific words and modes in the related domain, and to correct and complement the mode matching result, and to the characteristics of the input text, the HAN network is used, and its multi-layer attention mechanism can not only pay attention to the "words" and find out the important word components in the sentences, but also pay attention to the "sentences" and find out the important sentence components in the text; the method can be used for classifying short texts, and can solve the problem that the precision of a general classification method is reduced aiming at long texts. The algorithm network structure is shown in fig. 4, the scheme splices the mode characteristics obtained by the mode matching module with the input text representation to obtain an embedding input HAN network, and captures the characteristic information of the text through mode matching to enhance the model classification effect.

The word segmentation module in the step S2 adopts a word segmentation method based on a double-array Tire tree, which is based on the concept of compressing the Tire tree and has all the advantages of the Tire tree, so that the query efficiency is high, the storage space can be saved, and the application range is wide.

a) Constructing a double-array wire tree:

1) several important concepts in the wire tree:

and (5) state: a state;

code: a state transition value;

base: array representing base address of successor node, leaf node not successor, identification

Identifying the ending of the character sequence;

check: the address of the predecessor node is identified.

2) The construction process is as follows:

establishing a root node root, and enabling base [ root ] to be 1;

And iv, performing the following operation on each element in the root, child:

child ren was found _i N, if a character is at the end of the sequence, its child nodes include a null node with the code value set to 0, and a value begin is found such that each check is begin _i +element.children _i .code]＝0；

Set base [ element _i ]＝begin _i ；

Child element _i Step iv is executed recursively, if there is no leaf node children in a certain element, thenSet rule set base [ element ]]Is a negative value;

b) word segmentation:

reading texts to be segmented, traversing backwards in sequence, calculating according to the condition of i in the construction process of the double-array, when base [ s ] is t, indicating that c is 0 (meeting leaf nodes), recording the position index, and then obtaining the Dic [ index ] as the words in the matched domain dictionary.

In conclusion, the technical scheme provided by the invention automatically excavates more domain-specific vocabularies through multi-model fusion, enriches the domain dictionary, dynamically calculates the pattern matching score effect, combines the characteristics extracted by pattern matching with the text semantics, and improves the domain classification effect; finally, the dependent dictionary is adaptively adjusted according to the field, so that the model is not limited to a segmentation mode any more, and word segmentation dimensionality can be intelligently selected according to the data field.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A self-adaptive dynamic word segmentation method is characterized by comprising the following steps:

2. The adaptive dynamic word segmentation method according to claim 1, wherein the domain-specific word matching module in step S1 includes two processes of domain-specific word generation and domain-specific word matching, and the specific steps of the domain-specific word generation are as follows:

s1.1, preparing a field corpus and a non-field corpus;

s1.3, filtering stop words in the field word set;

the method comprises the following steps: and (3) testing a model:

wherein the content of the first and second substances,

the second method comprises the following steps: a solidification degree model:

3. The adaptive dynamic word segmentation method according to claim 2, wherein the specific process of the domain-specific vocabulary matching is as follows:

4. The adaptive dynamic word segmentation method according to claim 1, wherein the domain pattern matching module in step S2 includes domain matching patterns and pattern matching, and the domain matching patterns are used to preset some matching patterns for each domain and limit the distance between the preceding word and the following word to be less than 15.

5. The adaptive dynamic word segmentation method according to claim 4, wherein the specific process of pattern matching is as follows:

6. The adaptive dynamic word segmentation method as claimed in claim 1, wherein the domain classification module in step S3 is used to classify the text without the related domain-specific words and patterns, and for the modified complement of the pattern matching result, for the features of the input text, the HAN network is used, and its multi-layer attention mechanism can not only pay attention to the "words" and find out the important word components in the sentences, but also pay attention to the "sentences" and find out the important sentence components in the text.

7. The adaptive dynamic word segmentation method according to claim 1, wherein the word segmentation module in step S2 adopts a word segmentation method based on a dual-tuple tree, and specifically includes:

a) the structure process of the dual-array wire tree is as follows:

establishing a root node root, and enabling base [ root ] to be 1;

And iv, performing the following operation on each element in the root, child:

Set base [ element _i ]＝begin _i ；

b) word segmentation: