CN114881017A - Self-adaptive dynamic word segmentation method - Google Patents

Self-adaptive dynamic word segmentation method Download PDF

Info

Publication number
CN114881017A
CN114881017A CN202210441833.6A CN202210441833A CN114881017A CN 114881017 A CN114881017 A CN 114881017A CN 202210441833 A CN202210441833 A CN 202210441833A CN 114881017 A CN114881017 A CN 114881017A
Authority
CN
China
Prior art keywords
domain
matching
word segmentation
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210441833.6A
Other languages
Chinese (zh)
Inventor
王峥
杨梦玲
武志彦
董文君
臧高峰
陈虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Fiberhome Telecommunication Technologies Co ltd
Original Assignee
Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Fiberhome Telecommunication Technologies Co ltd filed Critical Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority to CN202210441833.6A priority Critical patent/CN114881017A/en
Publication of CN114881017A publication Critical patent/CN114881017A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a self-adaptive dynamic word segmentation method, which comprises the following steps: s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain; s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation; and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result. According to the self-adaptive dynamic word segmentation method, more domain-specific words are automatically mined through multi-model fusion, a domain dictionary is enriched, a pattern matching score effect is dynamically calculated, features extracted through pattern matching are combined with text semantics, the domain classification precision is improved, and word segmentation effects in different domains are improved.

Description

Self-adaptive dynamic word segmentation method
Technical Field
The invention relates to the technical field of word segmentation systems, in particular to a self-adaptive dynamic word segmentation method.
Background
Along with the popularization of computers, the importance of information is increasing day by day, and in the face of the data of the eight people in the network, how to mine the implicit information in the data and enable the data to exert the maximum value is an exploration hotspot of each person, the typical application of the data is a search engine, an intelligent question and answer, a knowledge graph and the like, all the applications are based on a word segmentation technology, and the word segmentation is performed, so that the word segmentation is equal to the perfect connection with a first bar. Currently, the known word segmentation systems include nubar, hahara LTP, HanLP, Stenford CoreNLP, and the like, and the adopted word segmentation technologies include methods based on dictionary matching, statistics, deep learning and the like.
However, in actual application, data in different fields should have different segmentation dimensions, for example, the word "K powder" should be segmented into two words "K/powder" in the general field, and the word "ketamine" should be segmented in the chemical field, is an anesthetic, is often taken as a drug to be sucked, and has an important effect on drug involvement, law enforcement and the like. However, in the word segmentation method based on dictionary matching in the industry, words are mostly put in one file, words are directly segmented according to a single dictionary, or a plurality of dictionaries are set, but the dictionaries are searched according to a fixed sequence during word segmentation, so that data segmentation dimensions in different fields are the same, and word segmentation generalization in a special field is insufficient.
In order to improve the generalization effect of word segmentation, word segmentation with different dimensions is performed on data in different fields, and a field classification technology combining field-specific vocabulary, pattern matching and deep learning is urgently needed to be designed to dynamically adjust a dictionary on which word segmentation depends so as to adaptively select proper segmentation dimensions according to the data in different fields. Therefore, we improve this and propose an adaptive dynamic word segmentation method.
Disclosure of Invention
In order to solve the technical problems, the invention provides the following technical scheme:
the invention discloses a self-adaptive dynamic word segmentation method, which comprises the following steps:
s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain;
s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation;
and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result.
As a preferred technical solution of the present invention, the domain-specific word matching module in step S1 includes two processes of domain-specific word generation and domain-specific word matching, and the specific steps of the domain-specific word generation are as follows:
s1.1, preparing a field corpus and a non-field corpus;
s1.2, performing primary word segmentation on the domain linguistic data and the non-domain linguistic data respectively, and obtaining a domain word set and a non-domain dictionary by directly adopting jieba word segmentation;
s1.3, filtering stop words in the field word set;
s1.4, the granularity of common participles of jieba is usually very small, and adjacent words can be combined into new words, and the method comprises the following three methods:
the method comprises the following steps: and (3) testing a model:
Figure BDA0003614260320000031
wherein the content of the first and second substances,
Figure BDA0003614260320000033
is the mean value of the samples, s 2 Is the sample variance, N is the size of the sample, u is the mean of the distribution; at the moment, the zero hypothesis is that the n-element phrases are independent, and all the element groups are calculated in a traversal modeT statistic to quadruple, at the level of confidence α of 0.005, for statistic t>2.576, we can have 99.5% confidence to reject the null hypothesis, i.e. have 99.5% confidence to consider the word true;
the second method comprises the following steps: a solidification degree model:
Figure BDA0003614260320000032
if the probability that the words X and Y appear together is divided by the probability value of the respective appearance, the probability that the words XY appear together is the highest;
the third method comprises the following steps: a degree of freedom model: h (u) ═ Σ i p i logp i
If the words X and Y appear more and more at two sides, namely the degree of freedom of taking words at two sides is higher, the words XY are more independent, and if the degree of freedom at one side is very low, the words XY do not appear independently and can be part of XYZ words;
s1.5, selecting a certain amount of words from the high confidence level to the low confidence level by using the sequencing results of the three methods in the S1.4, adding the words into a jieba word segmentation user-defined dictionary, carrying out word segmentation on the material data again, and calculating by using a word2vec model to obtain a space vector of each word;
s1.6, taking the intersection of the results of the three methods in S1.4 as a seed word, and taking the rest words as candidate new words;
s1.7, for each seed word, sorting and selecting words with higher similarity in the candidate new words according to the similarity, voting and sorting according to voting results to obtain field keywords;
and S1.8, taking a difference set of the domain keywords and the non-domain dictionary to finally obtain the domain dictionary.
As a preferred technical solution of the present invention, the specific process of the domain-specific vocabulary matching is as follows:
if the exclusive word of a single specific field is matched, directly loading the field dictionary and the common dictionary to combine to complete word segmentation;
if a plurality of domain exclusive dictionaries are matched at the same time, segmenting the text, and if each segment of text only has a specific domain dictionary, segmenting to complete word segmentation;
if there are still multiple domain dictionaries in an individual paragraph, then enter the domain matching mode.
As a preferred technical solution of the present invention, the domain pattern matching module in step S2 includes a domain matching pattern and a pattern matching, where the domain matching pattern is used to preset some matching patterns for each domain and limit the distance between the preceding word and the following word to be less than 15.
As a preferred technical solution of the present invention, the specific process of pattern matching is as follows:
s2.1, loading a preset field matching mode, and performing mode matching on the input text;
s2.2, matching with a special vocabulary, and if only matching with the mode of a single field, directly judging that the input text is the field;
s2.3, if the modes of a plurality of fields are in accordance with each other, performing segmentation processing, and performing mode matching on the text of each section independently;
s2.4, if the field conflict still exists in a certain text, a field scoring function is designed according to the scheme, and the method specifically comprises the following steps: suppose that the pattern with the largest number of successful matching patterns in a certain field is marked as match max Many times marked as match sec The minimum is marked as match min Then the score function score is:
Figure BDA0003614260320000051
score is dynamically adjusted with the result of pattern matching, match max With match sec 、match min The larger the difference, the larger score, i.e. the more prominent the mode features of a certain domain; conversely, the smaller score, the less distinct the mode features;
and S2.5, if the score value is more than 0.85, judging that the input text is the corresponding field, otherwise, recording the successfully matched mode of each field, adding a mode feature list, and entering a field classification module.
As a preferred technical solution of the present invention, the domain classification module in step S3 is used for classifying the text without the related domain-specific words and patterns, and for the modification and complementation of the pattern matching result, for the features of the input text, a HAN network is used, and its multi-layer attention mechanism can not only pay attention to the "words" and find out the important word components in the sentences, but also pay attention to the "sentences" and find out the important sentence components in the text.
As a preferred technical solution of the present invention, the word segmentation module in step S2 adopts a word segmentation method based on a dual-tuple tree, which specifically includes:
a) the structure process of the dual-array wire tree is as follows:
the transition from state s to t for the received character c, the improved storage conditions in the even array are: base [ s ] + c ═ t, check [ t ] ═ base [ s ];
establishing a root node root, and enabling base [ root ] to be 1;
finding child node set of root (root i N) so that check is done i ]=base[root]=1;
And iv, performing the following operation on each element in the root, child:
child ren was found i N) if a character is at the end of the sequence, its child node comprises a null node with its code value set to 0, and a value begin is found such that every check [ begin } i +element.children i .code]=0;
Set base [ element i ]=begin i
Child element i Step iv is executed recursively, if there is no leaf node child in a certain element, then base [ element ] is set]Is a negative value;
b) word segmentation:
reading texts to be word-segmented, traversing backwards in sequence, calculating according to the condition of i in the construction process of the double-array, when base [ s ] is t, indicating that c is 0, recording the position index, and then Dic [ index ] is the words in the matched domain dictionary.
The invention has the beneficial effects that:
according to the self-adaptive dynamic word segmentation method, more domain-specific words are automatically mined through multi-model fusion, a domain dictionary is enriched, a pattern matching score effect is dynamically calculated, and features extracted through pattern matching are combined with text semantics, so that a domain classification effect is improved; finally, the dependent dictionary is adaptively adjusted according to the field, so that the model is not limited to a segmentation mode any more, and word segmentation dimensionality can be intelligently selected according to the data field.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a general flow diagram of an adaptive dynamic word segmentation method of the present invention;
FIG. 2 is a flow chart of domain specific vocabulary generation for an adaptive dynamic word segmentation method of the present invention;
FIG. 3 is a diagram illustrating a new word combination in accordance with an adaptive dynamic word segmentation method of the present invention;
FIG. 4 is a schematic diagram of an algorithm network structure of an adaptive dynamic word segmentation method according to the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example (b): as shown in fig. 1, the invention relates to a self-adaptive dynamic word segmentation method, which comprises the following steps:
s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain;
s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation;
and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result.
The domain specific word matching module in step S1 includes two processes of domain specific word generation and domain specific word matching, the flow chart of the domain specific word generation is shown in fig. 2, and the specific steps are as follows:
s1.1, preparing a field corpus and a non-field corpus;
s1.2, performing primary word segmentation on the domain linguistic data and the non-domain linguistic data respectively, and obtaining a domain word set and a non-domain dictionary by directly adopting jieba word segmentation;
s1.3, filtering stop words such as ' used ', ' and the like in the field word set;
s1.4, the granularity of common participles of jieba is usually very small, as shown in FIG. 3, several adjacent words can be combined into new words, and the following three methods are included:
the method comprises the following steps: and (3) testing a model:
Figure BDA0003614260320000081
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003614260320000082
is the mean value of the samples, s 2 Is the sample variance, N is the size of the sample, u is the mean of the distribution; at this time, the zero hypothesis is that the n-element phrase appears independently, t statistics of all one-element to four-element are calculated in a traversal mode, and at the level that the confidence degree alpha is 0.005, the statistics t>2.576, we can have 99.5% confidence to reject the null hypothesis, i.e. have 99.5% confidence to consider the word true;
the second method comprises the following steps: a solidification degree model:
Figure BDA0003614260320000083
if the probability that the words X and Y appear together is divided by the probability value of the respective appearance, the probability that the words XY appear together is the highest;
the third method comprises the following steps: a degree of freedom model: h (u) ═ Σ i p i logp i
If the words X and Y appear more and more at two sides, namely the degree of freedom of taking words at two sides is higher, the words XY are more independent, and if the degree of freedom at one side is very low, the words XY do not appear independently and can be part of XYZ words;
s1.5, selecting a certain amount of words from the high confidence level to the low confidence level by using the sequencing results of the three methods in the S1.4, adding the words into a jieba word segmentation user-defined dictionary, carrying out word segmentation on the material data again, and calculating by using a word2vec model to obtain a space vector of each word;
s1.6, taking the intersection of the results of the three methods in S1.4 as a seed word, and taking the rest words as candidate new words;
s1.7, for each seed word, sorting and selecting words with higher similarity in the candidate new words according to the similarity, voting and sorting according to voting results to obtain field keywords;
and S1.8, taking a difference set of the domain keywords and the non-domain dictionary to finally obtain the domain dictionary.
The specific process of the field-specific vocabulary matching is as follows:
if the exclusive word of a single specific field is matched, directly loading the field dictionary and the common dictionary to combine to complete word segmentation;
if a plurality of domain exclusive dictionaries are matched at the same time, segmenting the text, and if each segment of text only has a specific domain dictionary, segmenting to complete word segmentation;
if there are still multiple domain dictionaries in an individual paragraph, then enter the domain matching mode.
The domain pattern matching module in step S2 includes a domain matching pattern and a pattern matching, where the domain matching pattern is used to preset some matching patterns for each domain and limit the distance between the preceding word and the following word to be less than 15, for example, the matching pattern of the traffic accident domain may be: occurrence of, cause of, etc.
The specific flow of pattern matching is as follows:
s2.1, loading a preset field matching mode, and performing mode matching on the input text;
s2.2, matching with exclusive vocabularies, and if only matching with the mode of a single field, directly judging that the input text is the field;
s2.3, if the modes of a plurality of fields are in accordance with each other, performing segmentation processing, and performing mode matching on the text of each section independently;
s2.4, if the field conflict still exists in a certain text, a field scoring function is designed according to the scheme, and the method specifically comprises the following steps: suppose that the pattern with the largest number of successful matching patterns in a certain field is marked as match max Many times marked as match sec The minimum is marked as match min Then the score function score is:
Figure BDA0003614260320000101
score is dynamically adjusted with the result of pattern matching, match max With match sec 、match min The larger the difference, the larger score, i.e. the more prominent the mode features of a certain domain; conversely, the smaller score, the less distinct the mode features;
and S2.5, if the score value is more than 0.85, judging that the input text is the corresponding field, otherwise, recording the successfully matched mode of each field, adding a mode feature list, and entering a field classification module.
The domain classification module in step S3 is used to classify the text without the specific words and modes in the related domain, and to correct and complement the mode matching result, and to the characteristics of the input text, the HAN network is used, and its multi-layer attention mechanism can not only pay attention to the "words" and find out the important word components in the sentences, but also pay attention to the "sentences" and find out the important sentence components in the text; the method can be used for classifying short texts, and can solve the problem that the precision of a general classification method is reduced aiming at long texts. The algorithm network structure is shown in fig. 4, the scheme splices the mode characteristics obtained by the mode matching module with the input text representation to obtain an embedding input HAN network, and captures the characteristic information of the text through mode matching to enhance the model classification effect.
The word segmentation module in the step S2 adopts a word segmentation method based on a double-array Tire tree, which is based on the concept of compressing the Tire tree and has all the advantages of the Tire tree, so that the query efficiency is high, the storage space can be saved, and the application range is wide.
a) Constructing a double-array wire tree:
1) several important concepts in the wire tree:
and (5) state: a state;
code: a state transition value;
base: array representing base address of successor node, leaf node not successor, identification
Identifying the ending of the character sequence;
check: the address of the predecessor node is identified.
2) The construction process is as follows:
the transition from state s to t for the received character c, the improved storage conditions in the even array are: base [ s ] + c ═ t, check [ t ] ═ base [ s ];
establishing a root node root, and enabling base [ root ] to be 1;
finding child node set of root (root i N) so that check is done i ]=base[root]=1;
And iv, performing the following operation on each element in the root, child:
child ren was found i N, if a character is at the end of the sequence, its child nodes include a null node with the code value set to 0, and a value begin is found such that each check is begin i +element.children i .code]=0;
Set base [ element i ]=begin i
Child element i Step iv is executed recursively, if there is no leaf node children in a certain element, thenSet rule set base [ element ]]Is a negative value;
b) word segmentation:
reading texts to be segmented, traversing backwards in sequence, calculating according to the condition of i in the construction process of the double-array, when base [ s ] is t, indicating that c is 0 (meeting leaf nodes), recording the position index, and then obtaining the Dic [ index ] as the words in the matched domain dictionary.
In conclusion, the technical scheme provided by the invention automatically excavates more domain-specific vocabularies through multi-model fusion, enriches the domain dictionary, dynamically calculates the pattern matching score effect, combines the characteristics extracted by pattern matching with the text semantics, and improves the domain classification effect; finally, the dependent dictionary is adaptively adjusted according to the field, so that the model is not limited to a segmentation mode any more, and word segmentation dimensionality can be intelligently selected according to the data field.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A self-adaptive dynamic word segmentation method is characterized by comprising the following steps:
s1, directly inputting the original text into a domain specific word matching module, and if the original text is matched with a specific dictionary of a certain domain, directly entering word segmentation of the domain;
s2, if the matching of the domain exclusive vocabulary fails, entering a domain mode matching module, matching all preset domain modes in the module, finally evaluating the matching effect, and if the matching is successful, directly entering a word segmentation module to complete word segmentation;
and S3, if the field pattern matching fails, entering a field classification module, performing field classification by using a deep learning model and a pattern matching effect, and finally completing word segmentation according to a classification result.
2. The adaptive dynamic word segmentation method according to claim 1, wherein the domain-specific word matching module in step S1 includes two processes of domain-specific word generation and domain-specific word matching, and the specific steps of the domain-specific word generation are as follows:
s1.1, preparing a field corpus and a non-field corpus;
s1.2, performing primary word segmentation on the domain linguistic data and the non-domain linguistic data respectively, and obtaining a domain word set and a non-domain dictionary by directly adopting jieba word segmentation;
s1.3, filtering stop words in the field word set;
s1.4, the granularity of common participles of jieba is usually very small, and adjacent words can be combined into new words, and the method comprises the following three methods:
the method comprises the following steps: and (3) testing a model:
Figure FDA0003614260310000011
wherein the content of the first and second substances,
Figure FDA0003614260310000012
is the mean value of the samples, s 2 Is the sample variance, N is the size of the sample, u is the mean of the distribution; at this time, the zero hypothesis is that the n-element phrase appears independently, t statistics of all one-element to four-element are calculated in a traversal mode, and at the level that the confidence degree alpha is 0.005, the statistics t>2.576, we can have 99.5% confidence to reject the null hypothesis, i.e. have 99.5% confidence to consider the word true;
the second method comprises the following steps: a solidification degree model:
Figure FDA0003614260310000021
if the probability that the words X and Y appear together is divided by the probability value of the respective appearance, the probability that the words XY appear together is the highest;
the third method comprises the following steps: a degree of freedom model: h (u) ═ Σ i p i logp i
If the words X and Y appear more and more at two sides, namely the degree of freedom of taking words at two sides is higher, the words XY are more independent, and if the degree of freedom at one side is very low, the words XY do not appear independently and can be part of XYZ words;
s1.5, selecting a certain amount of words from the high confidence level to the low confidence level by using the sequencing results of the three methods in the S1.4, adding the words into a jieba word segmentation user-defined dictionary, carrying out word segmentation on the material data again, and calculating by using a word2vec model to obtain a space vector of each word;
s1.6, taking the intersection of the results of the three methods in S1.4 as a seed word, and taking the rest words as candidate new words;
s1.7, for each seed word, sorting and selecting words with higher similarity in the candidate new words according to the similarity, voting and sorting according to voting results to obtain field keywords;
and S1.8, taking a difference set of the domain keywords and the non-domain dictionary to finally obtain the domain dictionary.
3. The adaptive dynamic word segmentation method according to claim 2, wherein the specific process of the domain-specific vocabulary matching is as follows:
if the exclusive word of a single specific field is matched, directly loading the field dictionary and the common dictionary to combine to complete word segmentation;
if a plurality of domain exclusive dictionaries are matched at the same time, segmenting the text, and if each segment of text only has a specific domain dictionary, segmenting to complete word segmentation;
if there are still multiple domain dictionaries in an individual paragraph, then enter the domain matching mode.
4. The adaptive dynamic word segmentation method according to claim 1, wherein the domain pattern matching module in step S2 includes domain matching patterns and pattern matching, and the domain matching patterns are used to preset some matching patterns for each domain and limit the distance between the preceding word and the following word to be less than 15.
5. The adaptive dynamic word segmentation method according to claim 4, wherein the specific process of pattern matching is as follows:
s2.1, loading a preset field matching mode, and performing mode matching on the input text;
s2.2, matching with exclusive vocabularies, and if only matching with the mode of a single field, directly judging that the input text is the field;
s2.3, if the modes of a plurality of fields are in accordance with each other, performing segmentation processing, and performing mode matching on the text of each section independently;
s2.4, if the field conflict still exists in a certain text, a field scoring function is designed according to the scheme, and the method specifically comprises the following steps: suppose that the pattern with the largest number of successful matching patterns in a certain field is marked as match max Many times marked as match sec The minimum is marked as match min Then the score function score is:
Figure FDA0003614260310000031
score is dynamically adjusted with the result of pattern matching, match max With match sec 、match min The larger the difference, the larger score, i.e. the more prominent the mode features of a certain domain; conversely, the smaller score, the less distinct the mode features;
and S2.5, if the score value is more than 0.85, judging that the input text is the corresponding field, otherwise, recording the successfully matched mode of each field, adding a mode feature list, and entering a field classification module.
6. The adaptive dynamic word segmentation method as claimed in claim 1, wherein the domain classification module in step S3 is used to classify the text without the related domain-specific words and patterns, and for the modified complement of the pattern matching result, for the features of the input text, the HAN network is used, and its multi-layer attention mechanism can not only pay attention to the "words" and find out the important word components in the sentences, but also pay attention to the "sentences" and find out the important sentence components in the text.
7. The adaptive dynamic word segmentation method according to claim 1, wherein the word segmentation module in step S2 adopts a word segmentation method based on a dual-tuple tree, and specifically includes:
a) the structure process of the dual-array wire tree is as follows:
the transition from state s to t for the received character c, the improved storage conditions in the even array are: base [ s ] + c ═ t, check [ t ] ═ base [ s ];
establishing a root node root, and enabling base [ root ] to be 1;
finding child node set of root (root i N) so that check is done i ]=base[root]=1;
And iv, performing the following operation on each element in the root, child:
child ren was found i N, if a character is at the end of the sequence, its child nodes include a null node with the code value set to 0, and a value begin is found such that each check is begin i +element.children i .code]=0;
Set base [ element i ]=begin i
Child element i Step iv is executed recursively, if there is no leaf node child in a certain element, then base [ element ] is set]Is a negative value;
b) word segmentation:
reading texts to be word-segmented, traversing backwards in sequence, calculating according to the condition of i in the construction process of the double-array, when base [ s ] is t, indicating that c is 0, recording the position index, and then Dic [ index ] is the words in the matched domain dictionary.
CN202210441833.6A 2022-04-25 2022-04-25 Self-adaptive dynamic word segmentation method Pending CN114881017A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210441833.6A CN114881017A (en) 2022-04-25 2022-04-25 Self-adaptive dynamic word segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210441833.6A CN114881017A (en) 2022-04-25 2022-04-25 Self-adaptive dynamic word segmentation method

Publications (1)

Publication Number Publication Date
CN114881017A true CN114881017A (en) 2022-08-09

Family

ID=82671259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210441833.6A Pending CN114881017A (en) 2022-04-25 2022-04-25 Self-adaptive dynamic word segmentation method

Country Status (1)

Country Link
CN (1) CN114881017A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413998A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 Self-adaptive Chinese word segmentation method, system and medium for power industry
US20200133962A1 (en) * 2018-10-25 2020-04-30 Institute For Information Industry Knowledge graph generating apparatus, method, and non-transitory computer readable storage medium thereof
CN111241833A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Word segmentation method and device for text data and electronic equipment
CN112397054A (en) * 2020-12-17 2021-02-23 北京中电飞华通信有限公司 Power dispatching voice recognition method
CN112632292A (en) * 2020-12-23 2021-04-09 深圳壹账通智能科技有限公司 Method, device and equipment for extracting service keywords and storage medium
CN113011183A (en) * 2021-03-23 2021-06-22 北京科东电力控制***有限责任公司 Unstructured text data processing method and system in electric power regulation and control field

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200133962A1 (en) * 2018-10-25 2020-04-30 Institute For Information Industry Knowledge graph generating apparatus, method, and non-transitory computer readable storage medium thereof
CN110413998A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 Self-adaptive Chinese word segmentation method, system and medium for power industry
CN111241833A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Word segmentation method and device for text data and electronic equipment
CN112397054A (en) * 2020-12-17 2021-02-23 北京中电飞华通信有限公司 Power dispatching voice recognition method
CN112632292A (en) * 2020-12-23 2021-04-09 深圳壹账通智能科技有限公司 Method, device and equipment for extracting service keywords and storage medium
CN113011183A (en) * 2021-03-23 2021-06-22 北京科东电力控制***有限责任公司 Unstructured text data processing method and system in electric power regulation and control field

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
US20160140104A1 (en) Methods and systems related to information extraction
CN110597961B (en) Text category labeling method and device, electronic equipment and storage medium
Ekbal et al. Named entity recognition in Bengali: A multi-engine approach
CN110941720B (en) Knowledge base-based specific personnel information error correction method
CN112732916A (en) BERT-based multi-feature fusion fuzzy text classification model
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN112183094A (en) Chinese grammar debugging method and system based on multivariate text features
CN112307364B (en) Character representation-oriented news text place extraction method
CN113157859A (en) Event detection method based on upper concept information
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
WO2023084222A1 (en) Machine learning based models for labelling text data
CN115577080A (en) Question reply matching method, system, server and storage medium
CN113158667B (en) Event detection method based on entity relationship level attention mechanism
CN113065350A (en) Biomedical text word sense disambiguation method based on attention neural network
CN112528653A (en) Short text entity identification method and system
CN110874408B (en) Model training method, text recognition device and computing equipment
Berrimi et al. A Comparative Study of Effective Approaches for Arabic Text Classification
CN114881017A (en) Self-adaptive dynamic word segmentation method
CN114756650A (en) Automatic comparison analysis processing method and system for super-large scale data
Meng et al. Learning belief networks for language understanding
Oprean et al. Handwritten word recognition using Web resources and recurrent neural networks
Davoudi et al. Lexicon reduction for printed Farsi subwords using pictorial and textual dictionaries
CN116340481B (en) Method and device for automatically replying to question, computer readable storage medium and terminal
CN118152570A (en) Intelligent text classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination