CN114881017A - Self-adaptive dynamic word segmentation method - Google Patents
Self-adaptive dynamic word segmentation method Download PDFInfo
- Publication number
- CN114881017A CN114881017A CN202210441833.6A CN202210441833A CN114881017A CN 114881017 A CN114881017 A CN 114881017A CN 202210441833 A CN202210441833 A CN 202210441833A CN 114881017 A CN114881017 A CN 114881017A
- Authority
- CN
- China
- Prior art keywords
- domain
- matching
- word segmentation
- word
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 86
- 238000000034 method Methods 0.000 title claims abstract description 64
- 230000000694 effects Effects 0.000 claims abstract description 17
- 238000013136 deep learning model Methods 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 14
- 230000003044 adaptive effect Effects 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 239000000463 material Substances 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000007711 solidification Methods 0.000 claims description 3
- 230000008023 solidification Effects 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 230000000295 complement effect Effects 0.000 claims description 2
- 230000004927 fusion Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 239000000843 powder Substances 0.000 description 2
- YQEZLKZALYSWHR-UHFFFAOYSA-N Ketamine Chemical compound C=1C=CC=C(Cl)C=1C1(NC)CCCCC1=O YQEZLKZALYSWHR-UHFFFAOYSA-N 0.000 description 1
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 230000003444 anaesthetic effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 229960003299 ketamine Drugs 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a self-adaptive dynamic word segmentation method, which comprises the following steps: s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain; s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation; and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result. According to the self-adaptive dynamic word segmentation method, more domain-specific words are automatically mined through multi-model fusion, a domain dictionary is enriched, a pattern matching score effect is dynamically calculated, features extracted through pattern matching are combined with text semantics, the domain classification precision is improved, and word segmentation effects in different domains are improved.
Description
Technical Field
The invention relates to the technical field of word segmentation systems, in particular to a self-adaptive dynamic word segmentation method.
Background
Along with the popularization of computers, the importance of information is increasing day by day, and in the face of the data of the eight people in the network, how to mine the implicit information in the data and enable the data to exert the maximum value is an exploration hotspot of each person, the typical application of the data is a search engine, an intelligent question and answer, a knowledge graph and the like, all the applications are based on a word segmentation technology, and the word segmentation is performed, so that the word segmentation is equal to the perfect connection with a first bar. Currently, the known word segmentation systems include nubar, hahara LTP, HanLP, Stenford CoreNLP, and the like, and the adopted word segmentation technologies include methods based on dictionary matching, statistics, deep learning and the like.
However, in actual application, data in different fields should have different segmentation dimensions, for example, the word "K powder" should be segmented into two words "K/powder" in the general field, and the word "ketamine" should be segmented in the chemical field, is an anesthetic, is often taken as a drug to be sucked, and has an important effect on drug involvement, law enforcement and the like. However, in the word segmentation method based on dictionary matching in the industry, words are mostly put in one file, words are directly segmented according to a single dictionary, or a plurality of dictionaries are set, but the dictionaries are searched according to a fixed sequence during word segmentation, so that data segmentation dimensions in different fields are the same, and word segmentation generalization in a special field is insufficient.
In order to improve the generalization effect of word segmentation, word segmentation with different dimensions is performed on data in different fields, and a field classification technology combining field-specific vocabulary, pattern matching and deep learning is urgently needed to be designed to dynamically adjust a dictionary on which word segmentation depends so as to adaptively select proper segmentation dimensions according to the data in different fields. Therefore, we improve this and propose an adaptive dynamic word segmentation method.
Disclosure of Invention
In order to solve the technical problems, the invention provides the following technical scheme:
the invention discloses a self-adaptive dynamic word segmentation method, which comprises the following steps:
s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain;
s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation;
and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result.
As a preferred technical solution of the present invention, the domain-specific word matching module in step S1 includes two processes of domain-specific word generation and domain-specific word matching, and the specific steps of the domain-specific word generation are as follows:
s1.1, preparing a field corpus and a non-field corpus;
s1.2, performing primary word segmentation on the domain linguistic data and the non-domain linguistic data respectively, and obtaining a domain word set and a non-domain dictionary by directly adopting jieba word segmentation;
s1.3, filtering stop words in the field word set;
s1.4, the granularity of common participles of jieba is usually very small, and adjacent words can be combined into new words, and the method comprises the following three methods:
wherein the content of the first and second substances,is the mean value of the samples, s 2 Is the sample variance, N is the size of the sample, u is the mean of the distribution; at the moment, the zero hypothesis is that the n-element phrases are independent, and all the element groups are calculated in a traversal modeT statistic to quadruple, at the level of confidence α of 0.005, for statistic t>2.576, we can have 99.5% confidence to reject the null hypothesis, i.e. have 99.5% confidence to consider the word true;
if the probability that the words X and Y appear together is divided by the probability value of the respective appearance, the probability that the words XY appear together is the highest;
the third method comprises the following steps: a degree of freedom model: h (u) ═ Σ i p i logp i
If the words X and Y appear more and more at two sides, namely the degree of freedom of taking words at two sides is higher, the words XY are more independent, and if the degree of freedom at one side is very low, the words XY do not appear independently and can be part of XYZ words;
s1.5, selecting a certain amount of words from the high confidence level to the low confidence level by using the sequencing results of the three methods in the S1.4, adding the words into a jieba word segmentation user-defined dictionary, carrying out word segmentation on the material data again, and calculating by using a word2vec model to obtain a space vector of each word;
s1.6, taking the intersection of the results of the three methods in S1.4 as a seed word, and taking the rest words as candidate new words;
s1.7, for each seed word, sorting and selecting words with higher similarity in the candidate new words according to the similarity, voting and sorting according to voting results to obtain field keywords;
and S1.8, taking a difference set of the domain keywords and the non-domain dictionary to finally obtain the domain dictionary.
As a preferred technical solution of the present invention, the specific process of the domain-specific vocabulary matching is as follows:
if the exclusive word of a single specific field is matched, directly loading the field dictionary and the common dictionary to combine to complete word segmentation;
if a plurality of domain exclusive dictionaries are matched at the same time, segmenting the text, and if each segment of text only has a specific domain dictionary, segmenting to complete word segmentation;
if there are still multiple domain dictionaries in an individual paragraph, then enter the domain matching mode.
As a preferred technical solution of the present invention, the domain pattern matching module in step S2 includes a domain matching pattern and a pattern matching, where the domain matching pattern is used to preset some matching patterns for each domain and limit the distance between the preceding word and the following word to be less than 15.
As a preferred technical solution of the present invention, the specific process of pattern matching is as follows:
s2.1, loading a preset field matching mode, and performing mode matching on the input text;
s2.2, matching with a special vocabulary, and if only matching with the mode of a single field, directly judging that the input text is the field;
s2.3, if the modes of a plurality of fields are in accordance with each other, performing segmentation processing, and performing mode matching on the text of each section independently;
s2.4, if the field conflict still exists in a certain text, a field scoring function is designed according to the scheme, and the method specifically comprises the following steps: suppose that the pattern with the largest number of successful matching patterns in a certain field is marked as match max Many times marked as match sec The minimum is marked as match min Then the score function score is:
score is dynamically adjusted with the result of pattern matching, match max With match sec 、match min The larger the difference, the larger score, i.e. the more prominent the mode features of a certain domain; conversely, the smaller score, the less distinct the mode features;
and S2.5, if the score value is more than 0.85, judging that the input text is the corresponding field, otherwise, recording the successfully matched mode of each field, adding a mode feature list, and entering a field classification module.
As a preferred technical solution of the present invention, the domain classification module in step S3 is used for classifying the text without the related domain-specific words and patterns, and for the modification and complementation of the pattern matching result, for the features of the input text, a HAN network is used, and its multi-layer attention mechanism can not only pay attention to the "words" and find out the important word components in the sentences, but also pay attention to the "sentences" and find out the important sentence components in the text.
As a preferred technical solution of the present invention, the word segmentation module in step S2 adopts a word segmentation method based on a dual-tuple tree, which specifically includes:
a) the structure process of the dual-array wire tree is as follows:
the transition from state s to t for the received character c, the improved storage conditions in the even array are: base [ s ] + c ═ t, check [ t ] ═ base [ s ];
establishing a root node root, and enabling base [ root ] to be 1;
finding child node set of root (root i N) so that check is done i ]=base[root]=1;
And iv, performing the following operation on each element in the root, child:
child ren was found i N) if a character is at the end of the sequence, its child node comprises a null node with its code value set to 0, and a value begin is found such that every check [ begin } i +element.children i .code]=0;
Set base [ element i ]=begin i ;
Child element i Step iv is executed recursively, if there is no leaf node child in a certain element, then base [ element ] is set]Is a negative value;
b) word segmentation:
reading texts to be word-segmented, traversing backwards in sequence, calculating according to the condition of i in the construction process of the double-array, when base [ s ] is t, indicating that c is 0, recording the position index, and then Dic [ index ] is the words in the matched domain dictionary.
The invention has the beneficial effects that:
according to the self-adaptive dynamic word segmentation method, more domain-specific words are automatically mined through multi-model fusion, a domain dictionary is enriched, a pattern matching score effect is dynamically calculated, and features extracted through pattern matching are combined with text semantics, so that a domain classification effect is improved; finally, the dependent dictionary is adaptively adjusted according to the field, so that the model is not limited to a segmentation mode any more, and word segmentation dimensionality can be intelligently selected according to the data field.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a general flow diagram of an adaptive dynamic word segmentation method of the present invention;
FIG. 2 is a flow chart of domain specific vocabulary generation for an adaptive dynamic word segmentation method of the present invention;
FIG. 3 is a diagram illustrating a new word combination in accordance with an adaptive dynamic word segmentation method of the present invention;
FIG. 4 is a schematic diagram of an algorithm network structure of an adaptive dynamic word segmentation method according to the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example (b): as shown in fig. 1, the invention relates to a self-adaptive dynamic word segmentation method, which comprises the following steps:
s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain;
s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation;
and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result.
The domain specific word matching module in step S1 includes two processes of domain specific word generation and domain specific word matching, the flow chart of the domain specific word generation is shown in fig. 2, and the specific steps are as follows:
s1.1, preparing a field corpus and a non-field corpus;
s1.2, performing primary word segmentation on the domain linguistic data and the non-domain linguistic data respectively, and obtaining a domain word set and a non-domain dictionary by directly adopting jieba word segmentation;
s1.3, filtering stop words such as ' used ', ' and the like in the field word set;
s1.4, the granularity of common participles of jieba is usually very small, as shown in FIG. 3, several adjacent words can be combined into new words, and the following three methods are included:
wherein, the first and the second end of the pipe are connected with each other,is the mean value of the samples, s 2 Is the sample variance, N is the size of the sample, u is the mean of the distribution; at this time, the zero hypothesis is that the n-element phrase appears independently, t statistics of all one-element to four-element are calculated in a traversal mode, and at the level that the confidence degree alpha is 0.005, the statistics t>2.576, we can have 99.5% confidence to reject the null hypothesis, i.e. have 99.5% confidence to consider the word true;
if the probability that the words X and Y appear together is divided by the probability value of the respective appearance, the probability that the words XY appear together is the highest;
the third method comprises the following steps: a degree of freedom model: h (u) ═ Σ i p i logp i
If the words X and Y appear more and more at two sides, namely the degree of freedom of taking words at two sides is higher, the words XY are more independent, and if the degree of freedom at one side is very low, the words XY do not appear independently and can be part of XYZ words;
s1.5, selecting a certain amount of words from the high confidence level to the low confidence level by using the sequencing results of the three methods in the S1.4, adding the words into a jieba word segmentation user-defined dictionary, carrying out word segmentation on the material data again, and calculating by using a word2vec model to obtain a space vector of each word;
s1.6, taking the intersection of the results of the three methods in S1.4 as a seed word, and taking the rest words as candidate new words;
s1.7, for each seed word, sorting and selecting words with higher similarity in the candidate new words according to the similarity, voting and sorting according to voting results to obtain field keywords;
and S1.8, taking a difference set of the domain keywords and the non-domain dictionary to finally obtain the domain dictionary.
The specific process of the field-specific vocabulary matching is as follows:
if the exclusive word of a single specific field is matched, directly loading the field dictionary and the common dictionary to combine to complete word segmentation;
if a plurality of domain exclusive dictionaries are matched at the same time, segmenting the text, and if each segment of text only has a specific domain dictionary, segmenting to complete word segmentation;
if there are still multiple domain dictionaries in an individual paragraph, then enter the domain matching mode.
The domain pattern matching module in step S2 includes a domain matching pattern and a pattern matching, where the domain matching pattern is used to preset some matching patterns for each domain and limit the distance between the preceding word and the following word to be less than 15, for example, the matching pattern of the traffic accident domain may be: occurrence of, cause of, etc.
The specific flow of pattern matching is as follows:
s2.1, loading a preset field matching mode, and performing mode matching on the input text;
s2.2, matching with exclusive vocabularies, and if only matching with the mode of a single field, directly judging that the input text is the field;
s2.3, if the modes of a plurality of fields are in accordance with each other, performing segmentation processing, and performing mode matching on the text of each section independently;
s2.4, if the field conflict still exists in a certain text, a field scoring function is designed according to the scheme, and the method specifically comprises the following steps: suppose that the pattern with the largest number of successful matching patterns in a certain field is marked as match max Many times marked as match sec The minimum is marked as match min Then the score function score is:
score is dynamically adjusted with the result of pattern matching, match max With match sec 、match min The larger the difference, the larger score, i.e. the more prominent the mode features of a certain domain; conversely, the smaller score, the less distinct the mode features;
and S2.5, if the score value is more than 0.85, judging that the input text is the corresponding field, otherwise, recording the successfully matched mode of each field, adding a mode feature list, and entering a field classification module.
The domain classification module in step S3 is used to classify the text without the specific words and modes in the related domain, and to correct and complement the mode matching result, and to the characteristics of the input text, the HAN network is used, and its multi-layer attention mechanism can not only pay attention to the "words" and find out the important word components in the sentences, but also pay attention to the "sentences" and find out the important sentence components in the text; the method can be used for classifying short texts, and can solve the problem that the precision of a general classification method is reduced aiming at long texts. The algorithm network structure is shown in fig. 4, the scheme splices the mode characteristics obtained by the mode matching module with the input text representation to obtain an embedding input HAN network, and captures the characteristic information of the text through mode matching to enhance the model classification effect.
The word segmentation module in the step S2 adopts a word segmentation method based on a double-array Tire tree, which is based on the concept of compressing the Tire tree and has all the advantages of the Tire tree, so that the query efficiency is high, the storage space can be saved, and the application range is wide.
a) Constructing a double-array wire tree:
1) several important concepts in the wire tree:
and (5) state: a state;
code: a state transition value;
base: array representing base address of successor node, leaf node not successor, identification
Identifying the ending of the character sequence;
check: the address of the predecessor node is identified.
2) The construction process is as follows:
the transition from state s to t for the received character c, the improved storage conditions in the even array are: base [ s ] + c ═ t, check [ t ] ═ base [ s ];
establishing a root node root, and enabling base [ root ] to be 1;
finding child node set of root (root i N) so that check is done i ]=base[root]=1;
And iv, performing the following operation on each element in the root, child:
child ren was found i N, if a character is at the end of the sequence, its child nodes include a null node with the code value set to 0, and a value begin is found such that each check is begin i +element.children i .code]=0;
Set base [ element i ]=begin i ;
Child element i Step iv is executed recursively, if there is no leaf node children in a certain element, thenSet rule set base [ element ]]Is a negative value;
b) word segmentation:
reading texts to be segmented, traversing backwards in sequence, calculating according to the condition of i in the construction process of the double-array, when base [ s ] is t, indicating that c is 0 (meeting leaf nodes), recording the position index, and then obtaining the Dic [ index ] as the words in the matched domain dictionary.
In conclusion, the technical scheme provided by the invention automatically excavates more domain-specific vocabularies through multi-model fusion, enriches the domain dictionary, dynamically calculates the pattern matching score effect, combines the characteristics extracted by pattern matching with the text semantics, and improves the domain classification effect; finally, the dependent dictionary is adaptively adjusted according to the field, so that the model is not limited to a segmentation mode any more, and word segmentation dimensionality can be intelligently selected according to the data field.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (7)
1. A self-adaptive dynamic word segmentation method is characterized by comprising the following steps:
s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain;
s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation;
and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result.
2. The adaptive dynamic word segmentation method according to claim 1, wherein the domain-specific word matching module in step S1 includes two processes of domain-specific word generation and domain-specific word matching, and the specific steps of the domain-specific word generation are as follows:
s1.1, preparing a field corpus and a non-field corpus;
s1.2, performing primary word segmentation on the domain linguistic data and the non-domain linguistic data respectively, and obtaining a domain word set and a non-domain dictionary by directly adopting jieba word segmentation;
s1.3, filtering stop words in the field word set;
s1.4, the granularity of common participles of jieba is usually very small, and adjacent words can be combined into new words, and the method comprises the following three methods:
wherein the content of the first and second substances,is the mean value of the samples, s 2 Is the sample variance, N is the size of the sample, u is the mean of the distribution; at this time, the zero hypothesis is that the n-element phrase appears independently, t statistics of all one-element to four-element are calculated in a traversal mode, and at the level that the confidence degree alpha is 0.005, the statistics t>2.576, we can have 99.5% confidence to reject the null hypothesis, i.e. have 99.5% confidence to consider the word true;
if the probability that the words X and Y appear together is divided by the probability value of the respective appearance, the probability that the words XY appear together is the highest;
the third method comprises the following steps: a degree of freedom model: h (u) ═ Σ i p i logp i
If the words X and Y appear more and more at two sides, namely the degree of freedom of taking words at two sides is higher, the words XY are more independent, and if the degree of freedom at one side is very low, the words XY do not appear independently and can be part of XYZ words;
s1.5, selecting a certain amount of words from the high confidence level to the low confidence level by using the sequencing results of the three methods in the S1.4, adding the words into a jieba word segmentation user-defined dictionary, carrying out word segmentation on the material data again, and calculating by using a word2vec model to obtain a space vector of each word;
s1.6, taking the intersection of the results of the three methods in S1.4 as a seed word, and taking the rest words as candidate new words;
s1.7, for each seed word, sorting and selecting words with higher similarity in the candidate new words according to the similarity, voting and sorting according to voting results to obtain field keywords;
and S1.8, taking a difference set of the domain keywords and the non-domain dictionary to finally obtain the domain dictionary.
3. The adaptive dynamic word segmentation method according to claim 2, wherein the specific process of the domain-specific vocabulary matching is as follows:
if the exclusive word of a single specific field is matched, directly loading the field dictionary and the common dictionary to combine to complete word segmentation;
if a plurality of domain exclusive dictionaries are matched at the same time, segmenting the text, and if each segment of text only has a specific domain dictionary, segmenting to complete word segmentation;
if there are still multiple domain dictionaries in an individual paragraph, then enter the domain matching mode.
4. The adaptive dynamic word segmentation method according to claim 1, wherein the domain pattern matching module in step S2 includes domain matching patterns and pattern matching, and the domain matching patterns are used to preset some matching patterns for each domain and limit the distance between the preceding word and the following word to be less than 15.
5. The adaptive dynamic word segmentation method according to claim 4, wherein the specific process of pattern matching is as follows:
s2.1, loading a preset field matching mode, and performing mode matching on the input text;
s2.2, matching with exclusive vocabularies, and if only matching with the mode of a single field, directly judging that the input text is the field;
s2.3, if the modes of a plurality of fields are in accordance with each other, performing segmentation processing, and performing mode matching on the text of each section independently;
s2.4, if the field conflict still exists in a certain text, a field scoring function is designed according to the scheme, and the method specifically comprises the following steps: suppose that the pattern with the largest number of successful matching patterns in a certain field is marked as match max Many times marked as match sec The minimum is marked as match min Then the score function score is:
score is dynamically adjusted with the result of pattern matching, match max With match sec 、match min The larger the difference, the larger score, i.e. the more prominent the mode features of a certain domain; conversely, the smaller score, the less distinct the mode features;
and S2.5, if the score value is more than 0.85, judging that the input text is the corresponding field, otherwise, recording the successfully matched mode of each field, adding a mode feature list, and entering a field classification module.
6. The adaptive dynamic word segmentation method as claimed in claim 1, wherein the domain classification module in step S3 is used to classify the text without the related domain-specific words and patterns, and for the modified complement of the pattern matching result, for the features of the input text, the HAN network is used, and its multi-layer attention mechanism can not only pay attention to the "words" and find out the important word components in the sentences, but also pay attention to the "sentences" and find out the important sentence components in the text.
7. The adaptive dynamic word segmentation method according to claim 1, wherein the word segmentation module in step S2 adopts a word segmentation method based on a dual-tuple tree, and specifically includes:
a) the structure process of the dual-array wire tree is as follows:
the transition from state s to t for the received character c, the improved storage conditions in the even array are: base [ s ] + c ═ t, check [ t ] ═ base [ s ];
establishing a root node root, and enabling base [ root ] to be 1;
finding child node set of root (root i N) so that check is done i ]=base[root]=1;
And iv, performing the following operation on each element in the root, child:
child ren was found i N, if a character is at the end of the sequence, its child nodes include a null node with the code value set to 0, and a value begin is found such that each check is begin i +element.children i .code]=0;
Set base [ element i ]=begin i ;
Child element i Step iv is executed recursively, if there is no leaf node child in a certain element, then base [ element ] is set]Is a negative value;
b) word segmentation:
reading texts to be word-segmented, traversing backwards in sequence, calculating according to the condition of i in the construction process of the double-array, when base [ s ] is t, indicating that c is 0, recording the position index, and then Dic [ index ] is the words in the matched domain dictionary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210441833.6A CN114881017A (en) | 2022-04-25 | 2022-04-25 | Self-adaptive dynamic word segmentation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210441833.6A CN114881017A (en) | 2022-04-25 | 2022-04-25 | Self-adaptive dynamic word segmentation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114881017A true CN114881017A (en) | 2022-08-09 |
Family
ID=82671259
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210441833.6A Pending CN114881017A (en) | 2022-04-25 | 2022-04-25 | Self-adaptive dynamic word segmentation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114881017A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413998A (en) * | 2019-07-16 | 2019-11-05 | 深圳供电局有限公司 | Self-adaptive Chinese word segmentation method, system and medium for power industry |
US20200133962A1 (en) * | 2018-10-25 | 2020-04-30 | Institute For Information Industry | Knowledge graph generating apparatus, method, and non-transitory computer readable storage medium thereof |
CN111241833A (en) * | 2020-01-16 | 2020-06-05 | 支付宝(杭州)信息技术有限公司 | Word segmentation method and device for text data and electronic equipment |
CN112397054A (en) * | 2020-12-17 | 2021-02-23 | 北京中电飞华通信有限公司 | Power dispatching voice recognition method |
CN112632292A (en) * | 2020-12-23 | 2021-04-09 | 深圳壹账通智能科技有限公司 | Method, device and equipment for extracting service keywords and storage medium |
CN113011183A (en) * | 2021-03-23 | 2021-06-22 | 北京科东电力控制***有限责任公司 | Unstructured text data processing method and system in electric power regulation and control field |
-
2022
- 2022-04-25 CN CN202210441833.6A patent/CN114881017A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200133962A1 (en) * | 2018-10-25 | 2020-04-30 | Institute For Information Industry | Knowledge graph generating apparatus, method, and non-transitory computer readable storage medium thereof |
CN110413998A (en) * | 2019-07-16 | 2019-11-05 | 深圳供电局有限公司 | Self-adaptive Chinese word segmentation method, system and medium for power industry |
CN111241833A (en) * | 2020-01-16 | 2020-06-05 | 支付宝(杭州)信息技术有限公司 | Word segmentation method and device for text data and electronic equipment |
CN112397054A (en) * | 2020-12-17 | 2021-02-23 | 北京中电飞华通信有限公司 | Power dispatching voice recognition method |
CN112632292A (en) * | 2020-12-23 | 2021-04-09 | 深圳壹账通智能科技有限公司 | Method, device and equipment for extracting service keywords and storage medium |
CN113011183A (en) * | 2021-03-23 | 2021-06-22 | 北京科东电力控制***有限责任公司 | Unstructured text data processing method and system in electric power regulation and control field |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jung | Semantic vector learning for natural language understanding | |
US20160140104A1 (en) | Methods and systems related to information extraction | |
CN110597961B (en) | Text category labeling method and device, electronic equipment and storage medium | |
Ekbal et al. | Named entity recognition in Bengali: A multi-engine approach | |
CN110941720B (en) | Knowledge base-based specific personnel information error correction method | |
CN112732916A (en) | BERT-based multi-feature fusion fuzzy text classification model | |
CN111191051B (en) | Method and system for constructing emergency knowledge map based on Chinese word segmentation technology | |
CN112183094A (en) | Chinese grammar debugging method and system based on multivariate text features | |
CN112307364B (en) | Character representation-oriented news text place extraction method | |
CN113157859A (en) | Event detection method based on upper concept information | |
CN112988970A (en) | Text matching algorithm serving intelligent question-answering system | |
WO2023084222A1 (en) | Machine learning based models for labelling text data | |
CN115577080A (en) | Question reply matching method, system, server and storage medium | |
CN113158667B (en) | Event detection method based on entity relationship level attention mechanism | |
CN113065350A (en) | Biomedical text word sense disambiguation method based on attention neural network | |
CN112528653A (en) | Short text entity identification method and system | |
CN110874408B (en) | Model training method, text recognition device and computing equipment | |
Berrimi et al. | A Comparative Study of Effective Approaches for Arabic Text Classification | |
CN114881017A (en) | Self-adaptive dynamic word segmentation method | |
CN114756650A (en) | Automatic comparison analysis processing method and system for super-large scale data | |
Meng et al. | Learning belief networks for language understanding | |
Oprean et al. | Handwritten word recognition using Web resources and recurrent neural networks | |
Davoudi et al. | Lexicon reduction for printed Farsi subwords using pictorial and textual dictionaries | |
CN116340481B (en) | Method and device for automatically replying to question, computer readable storage medium and terminal | |
CN118152570A (en) | Intelligent text classification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |