CN113076750B

CN113076750B - Cross-domain Chinese word segmentation system and method based on new word discovery

Info

Publication number: CN113076750B
Application number: CN202110463683.4A
Authority: CN
Inventors: 张军; 李�学; 宁更新; 杨萃; 冯义志; 余华; 陈芳炯; 季飞
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2022-12-16
Anticipated expiration: 2041-04-26
Also published as: CN113076750A

Abstract

The invention discloses a cross-domain Chinese word segmentation system and a method based on new word discovery, wherein the system comprises: the new word discovery module is used for realizing a new word discovery algorithm by using enhanced mutual information combining statistical information and semantic information and is used for mining a new word list from the unmarked corpus; the automatic labeling module is used for realizing the initial segmentation of the unmarked corpus by combining a new word vocabulary with a reverse maximum matching algorithm to obtain an incompletely segmented corpus, and completely segmenting the incompletely segmented corpus by using a word segmentation model to obtain an automatically labeled corpus; and the cross-domain word segmentation module is used for realizing a cross-domain Chinese word segmentation algorithm by using a confrontation method and carrying out confrontation training by using labeled source domain linguistic data and automatically labeled linguistic data. The invention optimizes the new word discovery algorithm by using the enhanced mutual information, and improves the accuracy rate of new word discovery and the field of word lists; the utilization rate of the non-labeled corpus is improved in the cross-domain word segmentation algorithm, and the recall rate and the accuracy rate of the segmented words are optimized.

Description

Cross-domain Chinese word segmentation system and method based on new word discovery

Technical Field

The invention relates to the technical field of natural language, in particular to a cross-domain Chinese word segmentation system and a cross-domain Chinese word segmentation method based on new word discovery.

Background

The Chinese text takes the Chinese characters as the minimum writing unit, the Chinese characters are mutually combined to form words, and finally the words form the Chinese text. The words are the minimum structural units which contain semantic information and can be independently used in the Chinese text, but are different from languages such as English and the like, no explicit separators exist among the Chinese words, the Chinese text is divided into words by using a certain technical method so as to be convenient for a computer to understand, and the process is the Chinese word segmentation. Chinese segmentation is the most basic task in Chinese natural language processing, which is the cornerstone of natural language processing tasks such as text classification, text generation, and emotion analysis. Therefore, the quality of the Chinese word segmentation result directly influences the result of the downstream task.

The traditional Chinese word segmentation method mainly comprises two types: mechanical word segmentation method and Chinese word segmentation method based on statistics. The mechanical word segmentation method is used for segmenting words by taking the existing dictionary as a basis and combining a certain manual rule, and the recognition capability Of unknown words (Out Of Vocalburary, OOV, words which do not appear in a word segmentation dictionary) is very low; the word segmentation method based on statistics is limited in a very small range of context, global features cannot be counted, and the recognition capability of unknown words is equally low, so that the accuracy and recall rate of the two word segmentation methods are very poor, and the two word segmentation methods cannot be used as the practical word segmentation technology at present. With the development of deep learning, applying the deep learning technology to Chinese word segmentation becomes a current research hotspot. The existing neural network model treats Chinese word segmentation as a sequence labeling problem, the model is trained by a manually labeled data set, a Chinese dictionary and an artificial construction rule do not need to be obtained, a characteristic template does not need to be artificially designed, and the accuracy and the recall rate are far higher than those of the traditional method, so that the neural network model becomes a mainstream method of Chinese word segmentation at present. At present, a model trained based on large-scale artificial labeling data (called a source field) has a good effect on a word segmentation task, but when the model segments a cross-field corpus (a target field), the effect is rapidly reduced, and Chinese word segmentation in the source field and the target field, which do not belong to the same field, is called cross-field Chinese word segmentation. The effect of Chinese word segmentation across fields is mainly limited by the problems of Expression gaps (Expression gaps) and unknown words of texts between a target field and a source field, wherein the Expression gaps refer to different segmentation modes of the same text in different fields, and the unknown words refer to words which only appear in the target field but not in the source field, and mainly include names of people, place names and professional nouns. The best method for solving the expression gap is to artificially label the corpora of the target field and then mix the corpora of the two fields to retrain the model, however, large-scale artificial labeling needs to consume a large amount of manpower and material resources, and artificial labeling of all the fields is not possible, so that the method has no practical feasibility; the best method for solving the problem of unknown words is to lead a professional to extract words which are not appeared in the linguistic data of the source field from the linguistic data of the target field and put the words as training linguistic data into a model for training, however, on one hand, the selection of the unknown words needs to consume a large amount of manpower and material resources, and on the other hand, because various new words are generated in the current society, all the unknown words cannot be selected by means of manpower. Therefore, the cross-domain Chinese word segmentation is difficult to achieve a good effect all the time.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a cross-domain Chinese word segmentation system and method based on new word discovery. The cross-domain Chinese word segmentation system is divided into three modules, namely a new word discovery module, an automatic labeling module and a cross-domain word segmentation module. The difference between the invention and the traditional Chinese word segmentation method is that: (1) The relevance based on semantic information is added on the basis of a traditional new word discovery algorithm based on mutual information and adjacent entropy to form vector enhanced mutual information, so that the internal condensation degree of a character string can be measured more accurately, and the generation of junk character strings is reduced; (2) Compared with the traditional cross-domain Chinese word segmentation method based on dictionary and source domain labeled corpus training, the method realizes automatic labeling of the target domain corpus without labels based on the domain dictionary, can obviously reduce the unknown word rate of the test corpus, and then converts the cross-domain word segmentation into the intra-domain word segmentation by utilizing the automatically labeled corpus training model, and can obviously reduce the influence of the expression gap on the model. The invention can be widely applied to various fields without large-scale language materials, such as the medical field, the scientific field, the biological field, the novel field and the like.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a cross-domain Chinese word segmentation system based on new word discovery is composed of a new word discovery module, an automatic labeling module and a cross-domain word segmentation module, wherein the three modules are connected in sequence. The above modules are as follows:

(1) The new word discovery module: the method is used for extracting new words which do not appear in the source field, namely unknown words, from the target field linguistic data without labels. The new word discovery module is composed of a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module and a candidate word filtering sub-module, wherein the candidate word extraction sub-module, the enhanced mutual information extraction sub-module and the candidate word filtering sub-module are sequentially connected, the candidate word extraction sub-module is used for extracting all candidate words from the target field corpus, the enhanced mutual information extraction sub-module is used for extracting enhanced mutual information of all candidate words, and the candidate word filtering sub-module is used for filtering the candidate words; the candidate word filtering sub-module, the adjacent entropy extraction sub-module and the candidate word filtering sub-module are sequentially connected, and the adjacent entropy extraction sub-module is used for extracting adjacent entropies of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.

(2) An automatic labeling module: and automatically labeling the linguistic data in the non-labeled target field by using a new word list and a word segmentation algorithm obtained in the new word discovery module. The automatic labeling module is composed of a first Chinese word segmentation sub-module and a second Chinese word segmentation sub-module, the first Chinese word segmentation sub-module matches the linguistic data based on the new word vocabulary, if the matching is successful, segmentation is carried out, otherwise segmentation is not carried out, and incomplete segmentation of the linguistic data in the target field is achieved; and the second Chinese word segmentation sub-module segments the un-segmented corpora in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on corpus training of the source field, so that the complete segmentation of the corpora of the target field is realized.

(3) A cross-domain word segmentation module: and training the confrontation type deep neural network by using the labeled source domain linguistic data and the automatically labeled target domain linguistic data, and converting the cross-domain participle into the in-domain participle to realize the participle of the target domain. The cross-domain word segmentation module comprises a source domain feature extraction submodule, a public feature extraction submodule, a target domain feature extraction submodule, a source domain lexeme labeling submodule, a text classification submodule and a target domain lexeme labeling submodule, wherein the source domain feature extraction submodule and the source domain lexeme labeling submodule are connected to form a branch I; the public feature extraction sub-module is respectively connected with the text classification sub-module, the source field lexeme labeling sub-module and the target field lexeme labeling sub-module to form a second branch, the public feature extraction sub-module is used for extracting public features of source field linguistic data and target field linguistic data, the text classification sub-module is used for judging which field the input comes from, and the target field lexeme labeling sub-module is connected for performing lexeme labeling on the target field linguistic data; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a third branch, and the target field feature extraction submodule is used for extracting the unique features of the target field corpus.

Furthermore, the source domain feature extraction submodule, the target domain feature extraction submodule and the public feature extraction submodule all adopt GCNN as a feature extractor, the GCNN comprises 4 CNN layers and 1 activation layer, input vectors parallelly enter the 4 CNN layers, feature extraction is carried out through the CNN to obtain 4 feature vectors, the feature vectors of the first CNN layer are input into the activation layers to be activated, the dimensionality is kept unchanged, numbers in the vectors are limited between 0 and 1 to serve as weight vectors, the vectors obtained by multiplying the weight vectors and the feature vectors output by the other 3 CNN layers are the final feature vectors, and the activation function is sigmoid.

The second purpose of the invention can be achieved by adopting the following technical scheme:

a cross-domain Chinese word segmentation method based on new word discovery is used for achieving word segmentation of linguistic data in different domains by adopting the following steps:

step S1: and mining a new word list of the field from the target field corpus by using a new word discovery module.

Step S2: and (3) automatically labeling the linguistic data of the non-labeled target field by using an automatic labeling module in combination with the new word list of the field obtained in the step (1).

And step S3: extracting the characteristics of the source field and the target field corpora through three branches of the cross-field word segmentation module, wherein the first branch uses the source field characteristic extraction submodule to extract the source field characteristic H from the source field corpora _src (ii) a The second branch circuit uses a common feature extraction submodule to extract the common features H of the corpora of the source field and the target field _shr (ii) a Branch three-use target field feature extraction submodule for extracting target field features H from target field corpus _tgt ；

And step S4: h obtained in step S3 _src And H _shr Inputting the predicted data into a source field lexeme labeling submodule to predict a source field lexeme label, and obtaining H in the step S3 _tgt And H _shr Inputting the predicted target field lexeme label into a target field lexeme labeling submodule to predict a target field lexeme label, and obtaining H in the step S3 _shr Input into a text classification sub-module to predict domain labels for the input text.

Further, in the step S1, the step of using the new word discovery module to dig out the new word list of the field from the target field corpus includes the following steps: step S1.1: and extracting all candidate words with the length not exceeding n in the field corpus from the unmarked target field corpus by using a candidate word extraction sub-module.

Step S1.2: randomly splitting a candidate word C into a front internal segment A and a rear internal segment B, and counting the times of C, A and B as n _C 、n _A And n _B The mutual information MI of C is calculated by the following method _C 。

Wherein n is _w Indicating the number of occurrences of an arbitrary character string w in the corpus.

Step S1.3: training Word2Vec using target domain corpusModel to obtain any character c _j Word vector of

The word vector Vec of the internal segment A is calculated by the following method _A And the word vector Vec of the inner segment B _B ：

Wherein i represents the number of Chinese characters in A, m represents the number of Chinese characters in B, a _p And b _q Representing the values of the word vector at positions p and q, n representing the vector dimension;

step S1.4: from the word vectors Vec of the internal segment a and the internal segment B in step S1.3 _A 、Vec _B Calculating the semantic relevance sim (A, B) of the internal segment A and the internal segment B by adopting the following method:

step S1.5: according to mutual information MI in step S1.2 _C And semantic relevance sim (A, B) of step S1.4, calculating the enhanced mutual information ENMI of the candidate word C by the following method _C ：

Wherein beta is ₁ And representing the weight coefficient of the semantic relevance in the enhanced mutual information.

Step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data ₁ ,…L _u …L _H ]And right adjacent word R ₁ ,…R _v …R _D ]Wherein H andd respectively represents the number of left adjacent words and the number of right adjacent words, and the frequency [ n (L) of the left adjacent word appearing at the left side of the candidate word is recorded ₁ ),…n(L _u )…n(L _H )]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R) ₁ ),…n(R _d )…n(R _D )]The probability of each adjacent word is calculated separately using the following formula:

step S1.7: from the probabilities p (L) of the left and right neighbouring words in step S1.6 _u ) And p (R) _v ) Calculating the left adjacent entropy H of the candidate word C by using the following formula _l (C) And right adjacent entropy H _r (C)：

Step S1.8: according to the left adjacent entropy H of the candidate word C in step S1.7 _l (C) And right adjacent entropy H _r (C) And calculating the adjacency entropy of the candidate word C by adopting the following method:

step S1.9: according to the enhanced mutual information ENMI in step S1.4 _C And adjacency entropy BE in step S1.5 _C Calculating the overall score of the candidate word C by adopting the following method:

score(C)＝sigmoid(β ₂ *ENMI _C +BE _C )

wherein, beta ₂ In order to enhance the weight of the mutual information in the overall score, sigmoid represents normalization, and the calculation method comprises the following steps:

step S1.10: and setting a candidate word score threshold, comparing the overall score (C) of the step C in the step S1.6 with the threshold, if the score (C) is greater than the threshold, determining that the candidate word is a reasonable word, and if not, removing the candidate word from the candidate word list to obtain a new word list.

Further, in step S2, the automatic labeling module combines the new word list of the field obtained in step S1 with the first chinese word segmentation module and the second chinese word segmentation module to realize automatic labeling of the corpus of the target field without labeling, and the process is as follows:

step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm, sets a maximum matching length N, takes out a character string with the length N from the tail character of a sentence, inquires whether the character string is in a new word list, if the character string is segmented, moves the current character position to the left by N distances, if the matching length is not reduced by 1, continues to match, and if the matching length is reduced to 1, the matching is still unsuccessful, moves the current character position to the left by a distance, continues to match until the whole sentence is matched, and realizes the primary segmentation of the target corpus;

step S2.2: segmenting labeled source field corpus into words according to spaces, marking each character in the words with a lexeme label according to the length of each word, taking an input text as an input, and taking the lexeme label as an output to construct a training data set, wherein the lexeme label comprises B, M, E and S, the B represents a starting character of a multi-word, the M represents a middle character in the multi-word, the E represents an ending character in the multi-word, and the S represents a character which is independent into words;

step S2.3: the second Chinese word segmentation module calculates the cost function by adopting the following method:

wherein, Y = (Y) ₁ ,y ₂ ,y ₃ ,y ₄ ) Real lexeme labels, y, representing characters _s ∈{0,1}，y _s A value of 1 indicates that the character tag is the s-th lexeme tag,

representing the output lexeme labels through the model,

representing the probability that the model predicts the character tag belongs to the s-th lexeme tag;

step S2.4: training data is input into the model for training until loss ₁ Is less than a preset value;

step S2.5: and (5) segmenting the unsingulated part in the incompletely-segmented corpus obtained in the first Chinese word segmentation module by using the second Chinese word segmentation module obtained through the training in the step (S2.4) to obtain the completely-segmented target field automatic labeling corpus.

Further, the step S3 process is as follows:

step S3.1: segmenting a source field corpus with labels and a target field corpus with automatic labels into words, marking a lexeme label B, M, E or S for each word according to the length of each word, wherein B represents a starting character of a multi-word, M represents a middle character in the multi-word, E represents an ending character in the multi-word, S represents a character which is independent into words, the corpora is used as input, and the lexeme labels are used as output to respectively construct a source field training set and a target field training set;

step S3.2: inputting the text of the source field training set obtained in the step S1 into a source field feature extraction submodule, wherein the vector output by the source field feature extraction submodule is the unique source field feature H of the source field corpus _src ；

Step S3.3: inputting the text of the target field training set obtained in the step S1 into a target field feature extraction submodule, and extracting the target field featuresThe vector output by the sub-module is the unique target domain feature H of the target domain corpus _tgt ；

Step S3.4: the texts in the source field training set and the target field corpus obtained in the step S1 are sequentially input into a common feature extraction submodule, and the vector output by the common feature extraction submodule is the feature H common to the source field and the target field _shr 。

Further, the step S4 process is as follows:

step S4.1: using source domain feature H _src And target Domain characteristic H _tgt Respectively with a common feature H _shr Splicing is carried out by adopting the following method to obtain the general characteristics of the source field

And general characteristics of the target area

Step S4.2: the cost functions of the source field lexeme labeling submodule, the target field lexeme labeling submodule and the text classification submodule are loss respectively _src 、loss _tgt And loss _shr The following method is adopted for calculation:

wherein, Y = (Y) ₁ ,y ₂ ,y ₃ ,y ₄ ) Real lexeme labels, y, representing characters _t ∈{0,1}，y _t When 1, the character label is the t-th lexeme label, y _t A value of 0 indicates that the character tag is not the tth lexeme tag,

representing the output lexeme labels through the model,

representing the probability that the model predicts the character tag belongs to the t-th lexeme tag; y '= (Y' ₁ ,y′ ₂ ) Real lexeme tag, y 'representing a character' _k ∈{0,1}，y′ _k When the number is 1, the character label is the kth lexeme label, y' _k A value of 0 indicates that the character tag is not the kth lexeme tag,

representing the output lexeme labels through the model,

representing the probability of the character label belonging to the ith lexeme label predicted by the model; y' = (Y ″) ₁ ,y″ ₂ ,y″ ₃ ,y″ ₄ ) Real field label, y ″, indicating an input sample _l ∈{0,1}，y″ _l A sample of 1 indicates that the sample is from the source domain, a sample of 0 indicates that the sample is from the target domain,

a sample domain label representing the output through the model,

representing the probability that the sample of the model output belongs to the l-th domain;

step S4.3: performing countermeasure training by using a source field lexeme labeling submodule, a target field lexeme labeling submodule and a text classification submodule, and realizing cross-field Chinese word segmentation from the source field to the target field, wherein the total cost function is as follows:

loss＝loss _src +β ₃ *loss _tgt +β ₄ *loss _shr

wherein beta is ₃ And beta ₄ Respectively represent loss _tgt And loss _shr The weight occupied in the total cost function;

step S4.4: the model is trained until the variation of loss is less than a predetermined value.

Compared with the prior art, the invention has the following advantages and effects:

1. the invention improves the new word discovery algorithm by using vector enhanced mutual information, effectively combines the statistical information and the semantic information in the corpus, not only obviously improves the accuracy of new word discovery, reduces the garbage word strings in the new word list, but also obviously enhances the field of the new word list.

2. The invention uses a new word discovery algorithm to extract a new word vocabulary of the corpus relative to the corpus of the source field from the target field corpus without labels, which is also called an unknown word vocabulary. The target domain linguistic data without labels are automatically labeled based on the new word list, and the unknown word rate of the test linguistic data relative to the training linguistic data is obviously reduced.

3. The method uses the labeled source field corpus and the automatically labeled target field corpus to train the Chinese word segmentation algorithm based on the antagonistic training, reduces the influence of noise samples in the automatically labeled corpus on model training, improves the effect of the cross-field Chinese word segmentation algorithm, and has the effect superior to the effect of the Chinese word segmentation algorithm without the antagonistic training.

Drawings

FIG. 1 is a block diagram of a cross-domain Chinese word segmentation system based on new word discovery as disclosed in the embodiments of the present invention;

FIG. 2 is a block diagram of a new word discovery module according to an embodiment of the present invention;

FIG. 3 is a block diagram of an automatic labeling module according to an embodiment of the present invention;

FIG. 4 is a block diagram of a domain-wide Chinese segmentation module in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a GCNN-CRF model network according to an embodiment of the present invention;

FIG. 6 is a diagram of a network structure of a TextCNN model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

A structural block diagram of a cross-domain chinese word segmentation system based on new word discovery disclosed in this embodiment is shown in fig. 1, and is formed by a new word discovery module, an automatic labeling module, and a cross-domain word segmentation module, where the new word discovery module, the automatic labeling module, and the cross-domain word segmentation module are connected in sequence, and are respectively used to mine new words from non-labeled target domain corpora, automatically label the non-labeled target domain corpora, and train a neural network for cross-domain chinese word segmentation.

A structural block diagram of a new word discovery module in this embodiment is shown in fig. 2, and the new word discovery module is composed of a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module, and a candidate word filtering sub-module, where the candidate word extraction sub-module, the enhanced mutual information extraction sub-module, and the candidate word filtering sub-module are connected in sequence, the candidate word extraction sub-module is used to extract all candidate words from a corpus of a target field, the enhanced mutual information extraction sub-module is used to extract enhanced mutual information of all candidate words, and the candidate word filtering sub-module is used to filter candidate words; the candidate word extraction sub-module, the adjacent entropy extraction sub-module and the candidate word filtering sub-module are sequentially connected, and the adjacent entropy extraction sub-module is used for extracting the adjacent entropy of all candidate words; and the candidate word extracting submodule is connected with the candidate word filtering submodule.

In the above embodiment, the candidate word extracting sub-module extracts all candidate words in the corpus from the unmarked target field corpus; the enhanced mutual information extraction sub-module considers the semantic information of the candidate words on the basis of the mutual information, and adds the semantic similarity of the candidate words into the mutual information to calculate the enhanced mutual information of the candidate words; on the basis of the existing calculation of the adjacent entropy, the adjacent entropy extraction sub-module fully considers the left and right adjacent entropies of the candidate words, gives certain weight to the left and right adjacent entropies, and calculates and obtains the adjacent entropies of all the candidate words by utilizing the information contained in the left and right adjacent entropies to the maximum extent; and the candidate word filtering sub-module adds the enhanced mutual information and the adjacent entropy of the candidate word according to a certain weight, balances the importance of the enhanced mutual information and the adjacent entropy to obtain a final score of the candidate word, and is used for filtering the candidate word to obtain a new word list of the corpus.

The structural block diagram of the auto-tagging module in this embodiment is shown in fig. 3, and is composed of a first chinese participle sub-module and a second chinese participle sub-module, where the first chinese participle sub-module is used to perform incomplete segmentation on a corpus of a target field, and the second chinese participle sub-module is used to perform complete segmentation on the incompletely segmented corpus.

In the above embodiment, the automatic labeling module is composed of a first chinese word segmentation sub-module and a second chinese word segmentation sub-module, the first chinese word segmentation sub-module matches the corpus based on the new word list, if the matching is successful, the corpus is segmented, otherwise, the corpus is not segmented, and incomplete segmentation of the target field corpus is realized; and the second Chinese word segmentation sub-module segments the un-segmented corpora in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on corpus training of the source field, so that the complete segmentation of the corpora of the target field is realized. The module realizes automatic labeling of the non-labeled corpus in two steps, can better label the non-labeled target field corpus by using the new word vocabulary, and improves the labeling accuracy.

A cross-domain word segmentation module structure block diagram in this embodiment is shown in fig. 4, and is composed of a source domain feature extraction sub-module, a common feature extraction sub-module, a target domain feature extraction sub-module, a source domain lexeme tagging sub-module, a text classification sub-module, and a target domain lexeme tagging sub-module, where the source domain feature extraction sub-module is connected with the source domain lexeme tagging sub-module, the source domain feature extraction sub-module is used for extracting unique features of a source domain corpus, and the source domain lexeme tagging sub-module is used for performing lexeme tagging on the source domain corpus; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule, the public feature extraction submodule is used for extracting public features of source field linguistic data and target field linguistic data, the text classification submodule is used for judging which field the input comes from, and the target field lexeme labeling submodule is connected and used for performing lexeme labeling on the target field linguistic data; the target field feature extraction submodule is connected with the target field lexeme labeling submodule and is used for extracting the unique features of the target field linguistic data; the public feature extraction submodule is connected with the source field lexeme labeling submodule; the public characteristic extraction submodule is connected with the target field lexeme labeling submodule.

In the above embodiment, the cross-domain word segmentation module introduces countermeasure training, which includes three branches, and the source domain feature extraction sub-module and the source domain lexeme tagging sub-module are connected to form a branch one; the public feature extraction submodule is respectively connected with the text classification submodule, the source field lexeme labeling submodule and the target field lexeme labeling submodule to form a branch II; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a third branch. The cross-domain word segmentation module can effectively reduce the influence of noise in the automatic labeling corpus on model training by giving certain weight to the automatic labeling corpus in model training, and improves the word segmentation effect of the model.

Example two

The embodiment provides a cross-domain Chinese word segmentation method based on a cross-domain Chinese word segmentation system discovered based on new words, and the method adopts the following steps to realize word segmentation of linguistic data in different domains:

step S1: and mining a new word list of the field from the target field corpus by using a new word discovery module. In the step S1, the new word discovery module is used to mine the new word list of the target field from the corpus, and the method includes the following steps:

step S1.1: and extracting all candidate words with the length not exceeding n in the field corpus from the unmarked target field corpus by using a candidate word extraction sub-module.

In this embodiment, the candidate word extraction sub-module segments the corpus according to the non-chinese characters, sets the maximum candidate word length to 6, extracts all candidate words having a length not exceeding 6 from the sentences of the segmented corpus, stores the extracted candidate words in the candidate word set, and counts the occurrence number of the candidate words according to the corpus, and stores the candidate words in the dictionary D.

Wherein n is _w Representing the number of occurrences of any character string w in the corpus;

in this embodiment, the minimum length of both a and B is 1.

Step S1.3: training Word2Vec model by using target field corpus to obtain any character c _j Word vector of

Wherein i represents the number of Chinese characters in A, m represents the number of Chinese characters in B, a _p And b _q Represents the value of the word vector at positions p and q, n represents the vector dimension;

step S1.4: word vectors Vec from the internal segment a and the internal segment B in step S1.3 _A 、Vec _B Calculating the semantic relevance sim (A, B) of the internal segment A and the internal segment B by adopting the following method:

Wherein beta is ₁ Weight coefficients representing semantic relevance in the enhanced mutual information;

in the above embodiment, the weight coefficient β ₁ Is 300.

Step S1.6: respectively finding out all left adjacent characters [ L ] of candidate word C from target field linguistic data ₁ ,…L _u …L _H ]And the right adjacent word [ R ₁ ,…R _v …R _D ]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ] ₁ ),…n(L _u )…n(L _H )]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R) ₁ ),…n(R _d )…n(R _D )]The probability of each adjacent word is calculated separately using the following formula:

step S1.7: according to the probability p (L) of the left and right adjacent words in step S1.6 _u ) And p (R) _v ) Calculating the left adjacent entropy H of the candidate word C by using the following formula _l (C) And right adjacent entropy H _r (C)：

Step S1.8: according to the left adjacent entropy H of the candidate word C in step S1.7 _l (C) And right adjacent entropy H _r (C) And calculating the adjacency entropy value of C by adopting the following method:

step S1.9: according to the enhanced mutual information ENMI in step S1.4 _C And the adjacency entropy BE in step S1.5 _C Calculating the overall score of the candidate word C by adopting the following method:

score(C)＝sigmoid(β ₂ *ENMI _C +BE _C )

in this embodiment, the weight β occupied by the enhanced mutual information in the overall score is ₂ Is 60.

Step S1.10: and setting a candidate word score threshold, comparing the overall score (C) of the C in the step 1.6 with the threshold, if the score (C) is greater than the threshold, considering the candidate word as a reasonable word, and if not, removing the candidate word from the candidate word list to obtain a new word list.

In the present embodiment, the score threshold is set to 0.9.

Step S2: and the automatic labeling module is used for automatically labeling the linguistic data of the non-labeled target field by combining the new word list of the field obtained in the step S1.

In the step S2, the automatic tagging of the corpus of the target domain without tagging is realized by using the domain new word list obtained in the step S1 in combination with the first chinese word segmentation module and the second chinese word segmentation module, which includes the following steps:

step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm and sets a maximum matching length N. Starting from the last character of the sentence, taking out the character string with the length of N, inquiring whether the character string is in a new word list, if the character string is segmented, moving the current character position to the left by N distances, if the character string is not segmented, subtracting 1 from the matching length, continuing to match, if the matching length is not successful after subtracting 1, moving the current character position to the left by a distance, continuing to match until the whole sentence is matched, and realizing the preliminary segmentation of the target corpus.

In the above embodiment, the maximum matching length N is set to 6 (typically set to the maximum length of an entry in a vocabulary).

Step S2.2: dividing the marked source field corpus into words according to blank space, marking a lexeme label B, M, E and S for each character in the words according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character of the multi-word, E represents the end character of the multi-word, S represents the character which is independently formed into words, the input text is used as input, and the lexeme label is used as output to construct a training data set.

representing the output lexeme labels through the model,

in the above embodiment, a structural block diagram of the second chinese word segmentation module is shown in fig. 5, and includes 3 GCNN layers, 1 fully-connected layer, and 1 CRF layer. The input text vector becomes a feature vector with the dimension of 200 after passing through 3 GCNNs, the feature vector becomes an output vector with the dimension of 4 after passing through a full connection layer, and the lexeme label of the character is obtained after the output vector passes through CRF. The size of the convolution kernel of GCNN1 is 3, the number of the convolution kernels is 200, the size of the convolution kernel of GCNN2 is 4, the number of the convolution kernels is 200, the size of the convolution kernel of GCNN3 is 5, the number of the convolution kernels is 200, and the number of nodes of the full connection layer is 4.

Step S2.4: training data is input into the model for training until loss ₁ Is less than a preset value.

In the above embodiment, the preset value is set to 0.01.

Step S2.5: and (4) segmenting the unsingulated part in the incompletely segmented corpus obtained in the first Chinese word segmentation module by using the second Chinese word segmentation module in the step (S2.4) to obtain the completely segmented target field automatic labeling corpus.

And step S3: extracting characteristics of source domain and target domain corpora through three branches of a cross-domain word segmentation module, wherein a branch I uses source domain characteristicsExtraction of source domain characteristics H in source domain corpus by sign extraction submodule _src (ii) a The second branch circuit uses a common feature extraction submodule to extract the common features H of the corpora of the source field and the target field _shr (ii) a Branch three-use target field feature extraction submodule for extracting target field features H from target field corpus _tgt ；

The implementation of the above step S3 includes the following steps:

step S3.1: segmenting a source field corpus with labels and a target field corpus with automatic labels into words, and marking a lexeme label B, M, E or S for each word according to the length of each word, wherein B represents the initial character of a multi-word, M represents the middle character in the multi-word, E represents the end character in the multi-word, S represents the character which is independent into words, the corpus is used as input, and the lexeme label is used as output to respectively construct a source field training set and a target field training set;

step S3.2: inputting the text of the source field training set obtained in the step S1 into a source field feature extraction submodule, wherein the vector output by the source field feature extraction submodule is the unique source field feature H of the source field corpus _src 。

Step S3.3: inputting the text of the target field training set obtained in the step S1 into a target field feature extraction submodule, wherein the vector output by the target field feature extraction submodule is the unique target field feature H of the target field corpus _tgt 。

The source field feature extraction submodule, the target field feature extraction submodule and the public feature extraction submodule have completely consistent structures and comprise 3 GCNN layers and 1 activation layer. The input text vector sequentially passes through 3 GCNNs to form a feature vector with the dimensionality of 200, the dimensionality of the feature vector remains unchanged after the feature vector passes through an activation layer, the size of each number in the vector is between 0 and 1, the convolution kernels of the 3 GCNNs are 3, 4 and 5, the number of the convolution kernels is 200, and the activation function of the activation layer is sigmoid.

And step S4: h obtained in step S3 _src And H _shr Inputting the predicted source field lexeme label into a source field lexeme labeling submodule to predict a source field lexeme label, and obtaining H in the step S3 _tgt And H _shr Inputting the predicted target field lexeme label into a target field lexeme labeling submodule to predict a target field lexeme label, and obtaining H in the step S3 _srr Input into a text classification sub-module to predict domain labels for the input text.

In the step S4, the label prediction using the feature includes the steps of:

And general characteristics of the target area

wherein, Y = (Y) ₁ ,y ₂ ,y ₃ ,y ₄ ) Real lexeme labels, y, representing characters _t ∈{0,1}，y _t When the value is 1, the character label is the t-th lexeme label, y _t A value of 0 indicates that the character tag is not the tth lexeme tag,

representing the output lexeme labels through the model,

representing the probability of the model predicting the character label belonging to the t-th lexeme label; y '= (Y' ₁ ,y′ ₂ ) Real lexeme tag, y 'representing a character' _k ∈{0,1}，y′ _k When the number is 1, the character label is the kth lexeme label, y' _k A value of 0 indicates that the character tag is not the kth lexeme tag,

representing the output lexeme labels through the model,

representing the probability of the character label belonging to the ith lexeme label predicted by the model; y "= (Y ″) ₁ ,y″ ₂ ,y″ ₃ ,y″ ₄ ) Real world tag, y ″, representing an input sample _l ∈{0,1}，y″ _l A sample of 1 indicates that the sample is from the source domain, a sample of 0 indicates that the sample is from the target domain,

a sample domain label representing the output through the model,

the probability that a sample representing the model output belongs to the l-th domain.

In the above embodiment, a structural block diagram of the text classification sub-module is shown in fig. 6, and includes 3 convolutional layers, 3 pooling layers, a splicing layer, and a full-link layer. The method comprises the steps that three feature vectors are obtained through three parallel convolutional layers respectively through input text vectors, then the three feature vectors are sequentially input into a pooling layer, then the three feature vectors output by the pooling layer are input into a splicing layer to be spliced, finally the spliced vectors are input into a full-connection layer, the sizes of convolution kernels of CNN1, CNN2 and CNN3 are respectively 3, 4 and 5, the number of the convolution kernels is 200, the pooling layer uses the maximum pooling, and the number of nodes of the full-connection layer is 2.

loss＝loss _src +α ₃ *loss _tgt +α ₄ *loss _shr

wherein beta is ₃ And beta ₄ Respectively represent loss _tgt And loss _shr The weight occupied in the total cost function.

In this embodiment, the default value of loss is 0.01.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims

1. A cross-domain Chinese word segmentation system based on new word discovery is characterized by comprising a new word discovery module, an automatic labeling module and a cross-domain word segmentation module which are connected in sequence, wherein,

the new word discovery module is used for extracting unknown words which do not appear in the source field from the target field linguistic data without labels to obtain a new word list of the field;

the automatic labeling module is used for carrying out initial segmentation on the non-labeled corpus by using a reverse maximum matching algorithm based on a new word vocabulary to obtain a corpus which is not completely segmented; completely segmenting the part which is not segmented in the initially segmented corpus by using a GCNN-CRF word segmentation algorithm based on corpus training of the source field, and realizing automatic segmentation of the corpus of the label-free target field;

the cross-domain word segmentation module is used for training the confrontation type deep neural network by using the labeled source domain linguistic data and the automatically labeled target domain linguistic data, and converting cross-domain words into in-domain words to realize word segmentation of the target domain;

a word segmentation method based on a cross-domain Chinese word segmentation system adopts the following steps to realize word segmentation of linguistic data in different domains:

step S1: and (3) excavating a new word list of the field from the corpus of the target field by using a new word discovery module, wherein the process is as follows:

step S1.1: extracting all candidate words with the length not exceeding n on the domain corpus from the unmarked target domain corpus by using a candidate word extraction submodule;

step S1.2: randomly splitting a candidate word C into a front internal segment A and a rear internal segment B, and counting the times of C, A and B as n _C 、n _A And n _B The mutual information MI of C is calculated by the following method _C ；

Wherein i represents the number of Kanji characters in A, m represents the number of Kanji characters in B, and a _p And b _q Representing the values of the word vector at positions p and q, n representing the vector dimension;

Wherein beta is ₁ Weight coefficients representing semantic relevance in enhancing mutual information;

step S1.6: respectively finding out target domain linguistic dataAll left adjacent words [ L ] of candidate word C ₁ ,…L _u …L _H ]And the right adjacent word [ R ₁ ,…R _v …R _D ]Where H and D represent the number of left and right adjacent words, respectively, and the number of occurrences of each left adjacent word on the left side of the candidate word [ n (L) ] ₁ ),…n(L _u )…n(L _H )]And the number of occurrences of each right adjacent word to the right of the candidate word [ n (R) ₁ ),…n(R _d )…n(R _D )]The probability of each adjacent word is calculated separately using the following formula:

Step S1.8: according to the left adjacent entropy H of the candidate word C in the step S1.7 _l (C) And right adjacent entropy H _r (C) And calculating the adjacency entropy of the candidate word C by adopting the following method:

score(C)＝sigmoid(β ₂ *ENMI _C +BE _C )

wherein, beta ₂ In order to enhance the weight of mutual information in the overall score, sigmoid represents normalization, and the calculation method comprises the following steps:

step S1.10: setting a candidate word score threshold, comparing the total score (C) of the step C in the step S1.6 with the threshold, if the score (C) is greater than the threshold, considering the candidate word as a reasonable word, and if not, removing the candidate word from the candidate word list to finally obtain a new word list;

step S2: automatically labeling the linguistic data of the non-labeled target field by using an automatic labeling module in combination with the new word list of the field obtained in the step S1;

And step S4: h obtained in step S3 _src And H _shr Inputting the predicted source field lexeme label into a source field lexeme labeling submodule to predict a source field lexeme label, and obtaining H in the step S3 _tgt And H _shr Inputting the predicted target domain word position label into a target domain word position labeling submodule to predict a target domain word position label, and obtaining H in the step S3 _shr Input into a text classification sub-module to predict domain labels for the input text.

2. The system according to claim 1, wherein the new word discovery module comprises a candidate word extraction sub-module, an enhanced mutual information extraction sub-module, an adjacent entropy extraction sub-module, and a candidate word filtering sub-module, wherein the candidate word extraction sub-module, the enhanced mutual information extraction sub-module, and the candidate word filtering sub-module are connected in sequence, the candidate word extraction sub-module is configured to extract all candidate words from the target domain corpus, the enhanced mutual information extraction sub-module is configured to extract enhanced mutual information of all candidate words, and the candidate word filtering sub-module is configured to filter the candidate words; the candidate word filtering submodule, the adjacent entropy extracting submodule and the candidate word filtering submodule are sequentially connected, and the adjacent entropy extracting submodule is used for extracting the adjacent entropy of all candidate words; the candidate word extracting sub-module is connected with the candidate word filtering sub-module.

3. The system according to claim 1, wherein the automatic labeling module comprises a first Chinese word segmentation submodule and a second Chinese word segmentation submodule, the first Chinese word segmentation submodule matches the corpus based on the new word vocabulary, if the matching is successful, segmentation is performed, otherwise segmentation is not performed, and incomplete segmentation of the corpus in the target field is realized; and the second Chinese word segmentation sub-module segments the un-segmented corpora in the first Chinese word segmentation sub-module by using a GCNN-CRF word segmentation algorithm based on corpus training of the source field, so that the complete segmentation of the corpora of the target field is realized.

4. The system for Chinese word segmentation across fields based on new word discovery according to claim 1, wherein the module for word segmentation across fields comprises a source field feature extraction submodule, a common feature extraction submodule, a target field feature extraction submodule, a source field lexeme labeling submodule, a text classification submodule and a target field lexeme labeling submodule, wherein the source field feature extraction submodule and the source field lexeme labeling submodule are connected to form a first branch, the source field feature extraction submodule is used for extracting unique features of a corpus of a source field, and the source field lexeme labeling submodule is used for word position labeling of the corpus of the source field; the public feature extraction sub-module is respectively connected with the text classification sub-module, the source field lexeme labeling sub-module and the target field lexeme labeling sub-module to form a second branch, the public feature extraction sub-module is used for extracting public features of source field linguistic data and target field linguistic data, the text classification sub-module is used for judging which field the input comes from, and the target field lexeme labeling sub-module is connected for performing lexeme labeling on the target field linguistic data; and the target field feature extraction submodule and the target field lexeme labeling submodule are connected to form a branch III, and the target field feature extraction submodule is used for extracting the unique features of the target field corpus.

5. The cross-domain Chinese word segmentation system based on new word discovery as claimed in claim 1, wherein the source domain feature extraction sub-module, the target domain feature extraction sub-module and the common feature extraction sub-module all use GCNN as a feature extractor, GCNN comprises 4 CNN layers and 1 activation layer, input vectors enter the 4 CNN layers in parallel, feature extraction is performed through CNN to obtain 4 feature vectors, the feature vector of the first CNN layer is input into the activation layer to be activated, the dimension is kept unchanged, numbers in the vectors are limited between 0 and 1 to serve as a weight vector, the vectors obtained by multiplying the weight vector by feature vectors output by the other 3 CNN layers are final feature vectors, and the activation function is sigmoid.

6. The system according to claim 1, wherein in step S2, the automatic labeling module combines the word list of the new words in the field obtained in step S1 with the first chinese word segmentation module and the second chinese word segmentation module to realize automatic labeling of the linguistic data of the target non-labeled field, and the process is as follows:

step S2.1: the first Chinese word segmentation module adopts a reverse maximum matching algorithm, sets a maximum matching length N, takes out a character string with the length of N from the last character of a sentence, inquires whether the character string is in a new word list, if the character string is segmented, moves the current character position to the left by N distances, if the character string is not segmented, subtracts 1 from the matching length, continues to match, if the matching length is still unsuccessful after subtracting 1 from the matching length, moves the current character position to the left by a distance, continues to match until the whole sentence is matched, and realizes the primary segmentation of the target corpus;

representing the output lexeme labels through the model,

representing the probability of the model predicting the character label belonging to the s-th lexeme label;

step S2.5: and (4) segmenting the unsingulated part in the incompletely segmented corpus obtained in the first Chinese segmentation module by using the second Chinese segmentation module obtained by training in the step (S2.4) to obtain the completely segmented target field automatic labeling corpus.

7. The system for Chinese segmentation across fields based on new word discovery as claimed in claim 1, wherein the step S3 process is as follows:

Step S3.3: inputting the text of the target field training set obtained in the step S1 into a target field feature extraction submodule, wherein the vector output by the target field feature extraction submodule is the unique target field feature H of the target field corpus _tgt ；

8. The system for Chinese segmentation across fields based on new word discovery as claimed in claim 1, wherein the step S4 comprises the following processes:

step S4.1: using source domain feature H _src And target Domain characteristics H _tgt Respectively with a common feature H _shr Splicing to obtain the total characteristics of the source field by adopting the following method

And general characteristics of the target area

representing the output lexeme labels through the model,

representing the probability that the model predicts the character tag belongs to the t-th lexeme tag; y '= (Y' ₁ ,y′ ₂ ) Real lexeme tag, y 'representing a character' _k ∈{0,1}，y′ _k When 1, the character tag is the kth lexeme tag, y′ _k A value of 0 indicates that the character tag is not the kth lexeme tag,

representing the output lexeme labels through the model,

representing the probability of the character label belonging to the ith lexeme label predicted by the model; y' = (Y ″) ₁ ,y″ ₂ ,y″ ₃ ,y″ ₄ ) Real world tag, y ″, representing an input sample _l ∈{0,1}，y″ _l A 1 indicates that the sample is from the source domain, a 0 indicates that the sample is from the target domain,

a sample domain label representing the output through the model,

loss＝loss _src +β ₃ *loss _tgt +β ₄ *loss _shr