CN113723088A

CN113723088A - Natural language processing method, natural language processing device, text processing method, text processing equipment and medium

Info

Publication number: CN113723088A
Application number: CN202010447042.5A
Authority: CN
Inventors: 丁宁; 龙定坤; 徐光伟; 王潇斌; 谢朋峻
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2021-11-30

Abstract

The embodiment of the disclosure discloses a natural language processing method, a natural language processing device, a text processing method, a text processing device and a medium, wherein the natural language processing method comprises the following steps: acquiring marked source field data and unmarked target field data; performing machine labeling on the target field data without the label by using the labeled source field data; and inputting the labeled source field data and the machine labeled target field data into a countermeasure network, acquiring optimized target field data, and optimizing the target field data by using the countermeasure network based on the labeled source field data and the machine labeled target field data to improve the word segmentation accuracy of the target field data.

Description

Natural language processing method, natural language processing device, text processing method, text processing equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a natural language processing method, an apparatus, a text processing method, a device, and a medium.

Background

In chinese, there is no explicit delimiter from word to word. Therefore, if a machine is required to acquire information of word units, the character sequence must be re-normalized to form a word sequence. Now, word segmentation has become an essential key step in natural language processing, for example, if we want to correctly identify named entities such as "AAA city", "BBB region", etc. in the address "AAA city BBB region CCC path", the correct word segmentation is a necessary preprocessing step.

At present, the deep learning model can achieve good effect on supervised word segmentation in a specific field. However, the inventor finds that the word segmentation effect of the excellent algorithm performance in the cross-domain scene is suddenly reduced, and a larger problem is exposed. This is mainly because each domain has specific words, and manually labeled corpora are often very limited and cannot cover each domain. This lack of markup corpus results in the model not being able to recognize words that were not registered in the training process. For example, a model trained in news corpora has difficulty in recognizing the word "streptococcus", because the latter often appears only in medical corpora. Cross-domain settings can significantly impair model performance under the supervised corpus.

Disclosure of Invention

In order to solve the problems in the related art, embodiments of the present disclosure provide a natural language processing method, apparatus, text processing method, device, and medium.

In a first aspect, an embodiment of the present disclosure provides a natural language processing method, including:

acquiring marked source field data and unmarked target field data;

performing machine labeling on the target field data without the label by using the labeled source field data;

and inputting the labeled source field data and the machine labeled target field data into a countermeasure network to obtain optimized target field data.

With reference to the first aspect, in a first implementation manner of the first aspect, the performing machine labeling on the label-free target domain data by using the labeled source domain data includes:

matching the labeled source field data with the unlabeled target field data to obtain source field specific words, shared words and target field data to be labeled;

obtaining a shared word vocabulary by using the shared words;

acquiring a specific dictionary of the target field by using the data to be labeled of the target field;

and acquiring target field data labeled by a machine according to the shared word list and the target field specific dictionary.

With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the obtaining a target domain specific dictionary by using the target domain to-be-labeled data includes:

identifying words in the data to be labeled in the target field by calculating one or more of the following indexes:

a condensed score, a free score, a word frequency, and a word frequency-inverse text frequency index.

With reference to the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the calculating the condensation score includes:

acquiring a first text segment from the data to be labeled in the target field;

acquiring a first probability of the first text segment appearing in the data to be labeled in the target field as a whole;

dividing the first text segment into two sub-segments, and determining a first division mode which enables the product of the probabilities of the two sub-segments appearing in the data to be labeled in the target field to be the largest; and

determining the target-domain-specific dictionary based on a ratio of the first probability to a product corresponding to the first segmentation manner.

With reference to the second implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the calculating the free score includes:

acquiring a second text segment from the data to be labeled in the target field;

determining left neighbor set entropy and right neighbor set entropy of the second text segment;

determining the target-domain-specific dictionary based on the lesser of the left-neighbor-set entropy and the right-neighbor-set entropy.

With reference to the first implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the obtaining a target domain specific dictionary by using the target domain to-be-labeled data includes:

and processing the data to be labeled in the target field through a preset word segmentation model to obtain a dictionary specific to the target field.

With reference to the first aspect and any one of the first to fifth implementation manners of the first aspect, in a sixth implementation manner of the first aspect, the present disclosure provides that the countermeasure network includes a shared encoder, a discriminator, and a target domain tokenizer, and the inputting the labeled source domain data and the labeled target domain data into the countermeasure network to obtain the optimized target domain data includes:

coding the third text segment in the marked source field data to obtain a first hidden vector;

coding a fourth text segment in the target field data labeled by the machine to obtain a second implicit vector;

the shared encoder randomly selects a first implicit vector or a second implicit vector as a first code to be input into the discriminator so that the discriminator judges whether the first code is the first implicit vector or the second implicit vector;

and the target field word segmentation device acquires optimized target field data according to the result of the discriminator.

With reference to the sixth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the target domain segmenter includes a target domain encoder and a first segmenter, and the obtaining optimized target domain data according to the result of the discriminator includes:

acquiring a second code through the target domain encoder, wherein the second code is acquired according to the second implicit vector;

and acquiring an optimized word segmentation result of the fourth text segment through a first word segmentation device based on the first code and the second code.

With reference to the sixth implementation manner of the first aspect, in an eighth implementation manner of the first aspect, the countermeasure network further includes a source-domain encoder and a second tokenizer, and the method further includes:

the source domain encoder acquires a third code, and the third code is acquired according to the first implicit vector;

and obtaining an optimized word segmentation result of the third text segment through a second word segmentation device based on the first code and the third code.

In a second aspect, an embodiment of the present disclosure provides a natural language processing apparatus, including:

an acquisition module configured to acquire labeled source domain data and unlabeled target domain data;

the labeling module is configured to perform machine labeling on the label-free target domain data by using the labeled source domain data;

and the input module is configured to input the labeled source domain data and the machine labeled target domain data into the countermeasure network, and acquire the optimized target domain data.

With reference to the second aspect, in a first implementation manner of the second aspect, the labeling module includes:

the matching sub-module is configured to match the labeled source field data with the unlabeled target field data to obtain source field specific words, shared words and target field data to be labeled;

the shared word list acquisition sub-module is configured to acquire a shared word list by using the shared words;

the target field special dictionary word acquisition sub-module is configured to acquire a target field special dictionary by using the target field data to be labeled;

and the target field data acquisition sub-module is configured to acquire the target field data labeled by the machine according to the shared vocabulary and the target field specific dictionary.

With reference to the first implementation manner of the second aspect, in a second implementation manner of the second aspect, the unique dictionary vocabulary acquiring submodule is further configured to:

With reference to the second implementation manner of the second aspect, in a third implementation manner of the second aspect, the disclosure performs calculation on the condensation score by:

acquiring a first text segment from the data to be labeled in the target field;

With reference to the second implementation manner of the second aspect, in a fourth implementation manner of the second aspect, the calculating the free score includes:

With reference to the first implementation manner of the second aspect, in a fifth implementation manner of the second aspect, the target-domain-specific dictionary word obtaining sub-module is further configured to:

With reference to the second aspect or any one of the first to fifth implementation manners of the second aspect, in a sixth implementation manner of the second aspect, the countermeasure network includes a shared encoder, a discriminator, and a target domain tokenizer, and the input module includes:

the first coding submodule is configured to code the third text segment in the labeled source field data to obtain a first implicit vector;

a second encoding submodule configured to encode a fourth text segment in the target domain data of the machine annotation to obtain a second implicit vector,

the shared encoder randomly selects a first implicit vector or a second implicit vector as a first code to be input into the discriminator, so that the discriminator judges whether the first code is the first implicit vector or the second implicit vector;

With reference to the sixth implementation manner of the second aspect, the present disclosure is in a seventh implementation manner of the second aspect, where the target domain tokenizer includes a target domain encoder and a first tokenizer, and the target domain tokenizer is configured to:

With reference to the sixth implementation manner of the second aspect, in an eighth implementation manner of the second aspect, the countermeasure network further includes a source-domain encoder and a second tokenizer, where:

and based on the first code and the third code, the second word segmentation device obtains an optimized word segmentation result of the third text segment.

In a third aspect, an embodiment of the present disclosure provides a text processing method, including:

acquiring first text data with a label and second text data without the label;

performing machine labeling on the second text data without the label by using the first text data with the label;

inputting the labeled first text data and the machine labeled second text data into a countermeasure network, and acquiring optimized second text data;

and outputting the optimized word segmentation result of the second text data based on the optimized second text data.

In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor; wherein the content of the first and second substances,

the memory is configured to store one or more computer instructions, which are executed by the processor to implement the method according to the first aspect, the first implementation manner to the eighth implementation manner of the first aspect, or the third aspect.

In a fifth aspect, an embodiment of the present disclosure provides a readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the method according to any one of the first aspect, the first implementation manner to the eighth implementation manner of the first aspect, or the third aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the technical scheme provided by the embodiment of the disclosure, marked source field data and unmarked target field data are obtained; performing machine labeling on the target field data without the label by using the labeled source field data; and inputting the labeled source field data and the machine labeled target field data into a countermeasure network, acquiring optimized target field data, and optimizing the target field data by using the countermeasure network based on the labeled source field data and the machine labeled target field data to improve the word segmentation accuracy of the target field data.

According to the technical scheme provided by the embodiment of the present disclosure, the performing machine labeling on the label-free target domain data by using the labeled source domain data includes: matching the labeled source field data with the unlabeled target field data to obtain source field specific words, shared words and target field data to be labeled; obtaining a shared word vocabulary by using the shared words; acquiring a specific dictionary of the target field by using the data to be labeled of the target field; and acquiring target field data labeled by a machine according to the shared word list and the special dictionary of the target field, and acquiring the target field data labeled by the machine aiming at the target field data without artificial labeling, thereby improving the word segmentation accuracy of the target field data.

According to the technical scheme provided by the embodiment of the present disclosure, acquiring the specific dictionary of the target field by using the data to be labeled of the target field includes: identifying words in the data to be labeled in the target field by calculating one or more of the following indexes: the aggregate score, the free score, the word frequency and the word frequency-inverse text frequency index can be used for acquiring the target field data labeled by a machine aiming at the target field data without artificial labeling, so that the accuracy of word segmentation of the target field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the calculation of the condensation fraction is carried out in the following way, including: acquiring a first text segment from the data to be labeled in the target field; acquiring a first probability of the first text segment appearing in the data to be labeled in the target field as a whole; dividing the first text segment into two sub-segments, and determining a first division mode which enables the product of the probabilities of the two sub-segments appearing in the data to be labeled in the target field to be the largest; and determining the dictionary specific to the target field based on the ratio of the first probability to the product corresponding to the first segmentation mode, so that target field data labeled by a machine can be acquired aiming at the target field data without manual labeling, and the accuracy of word segmentation of the target field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the calculation is carried out through the free fraction in the following way, including: acquiring a second text segment from the data to be labeled in the target field; determining left neighbor set entropy and right neighbor set entropy of the second text segment; and determining the target field specific dictionary based on the smaller of the left neighbor set entropy and the right neighbor set entropy, and optimizing target field data by using a confrontation network based on labeled source field data and machine labeled target field data to improve the accuracy of word segmentation of the target field data.

According to the technical scheme provided by the embodiment of the present disclosure, acquiring the specific dictionary of the target field by using the data to be labeled of the target field includes: the target field data to be labeled is processed through a preset word segmentation model, a special dictionary of the target field is obtained, the target field data can be optimized through a countermeasure network based on labeled source field data and machine labeled target field data, and the accuracy of word segmentation of the target field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the countermeasure network comprises a shared encoder, a discriminator and a target field participler, the labeled source field data and the labeled target field data of the machine are input into the countermeasure network, and the optimized target field data is obtained, and the method comprises the following steps: coding the third text segment in the marked source field data to obtain a first hidden vector; coding a fourth text segment in the target field data labeled by the machine to obtain a second implicit vector; the shared encoder randomly selects a first implicit vector or a second implicit vector as a first code to be input into the discriminator so that the discriminator judges whether the first code is the first implicit vector or the second implicit vector; the target field word segmentation device obtains optimized target field data according to the result of the discriminator, and can optimize the target field data by using a confrontation network based on the labeled source field data and the machine labeled target field data, so that the word segmentation accuracy of the target field data is improved.

According to the technical scheme provided by the embodiment of the present disclosure, the target domain word segmentation device includes a target domain encoder and a first word segmentation device, and the obtaining of the optimized target domain data according to the result of the discriminator includes: acquiring a second code through the target domain encoder, wherein the second code is acquired according to the second implicit vector; based on the first code and the second code, the optimized word segmentation result of the fourth text segment is obtained through the first word segmentation device, and the target field data can be optimized by using a confrontation network based on the labeled source field data and the machine labeled target field data, so that the word segmentation accuracy of the target field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the countermeasure network further comprises a source domain encoder and a second word splitter, and the method further comprises: the source domain encoder acquires a third code, and the third code is acquired according to the first implicit vector; based on the first code and the third code, the optimized word segmentation result of the third text segment is obtained through the second word segmentation device, and the target field data can be optimized by using a confrontation network based on the labeled source field data and the labeled target field data, so that the accuracy of word segmentation of the source field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the acquisition module is configured to acquire labeled source field data and unlabeled target field data; the labeling module is configured to perform machine labeling on the label-free target domain data by using the labeled source domain data; the input module is configured to input the labeled source field data and the machine labeled target field data to the confrontation network, acquire optimized target field data, and optimize the target field data by using the confrontation network based on the labeled source field data and the machine labeled target field data, so as to improve the accuracy of word segmentation of the target field data.

According to the technical scheme provided by the embodiment of the disclosure, the marking module comprises: the matching sub-module is configured to match the labeled source field data with the unlabeled target field data to obtain source field specific words, shared words and target field data to be labeled; the shared word list acquisition sub-module is configured to acquire a shared word list by using the shared words; the target field special dictionary word acquisition sub-module is configured to acquire a target field special dictionary by using the target field data to be labeled; and the target field data acquisition sub-module is configured to acquire the target field data labeled by the machine according to the shared vocabulary and the specific dictionary of the target field, and can acquire the target field data labeled by the machine aiming at the target field data without artificial labeling, so that the word segmentation accuracy of the target field data is improved.

According to the technical scheme provided by the embodiment of the present disclosure, the unique dictionary vocabulary acquiring submodule is further configured to: identifying words in the data to be labeled in the target field by calculating one or more of the following indexes: the aggregate score, the free score, the word frequency and the word frequency-inverse text frequency index can be used for acquiring the target field data labeled by a machine aiming at the target field data without artificial labeling, so that the accuracy of word segmentation of the target field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the target field specific dictionary word obtaining submodule is further configured to: the target field data to be labeled is processed through a preset word segmentation model, a special dictionary of the target field is obtained, the target field data can be optimized through a countermeasure network based on labeled source field data and machine labeled target field data, and the accuracy of word segmentation of the target field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the countermeasure network comprises a shared encoder, a discriminator and a target field participler, and the input module comprises: the first coding submodule is configured to code the third text segment in the labeled source field data to obtain a first implicit vector; the second coding submodule is configured to code a fourth text segment in the target field data labeled by the machine to obtain a second implicit vector, wherein the shared encoder randomly selects a first implicit vector or a second implicit vector as a first code and inputs the first implicit vector or the second implicit vector into the discriminator so that the discriminator judges whether the first code is the first implicit vector or the second implicit vector; the target field word segmentation device obtains optimized target field data according to the result of the discriminator, and can optimize the target field data by using a confrontation network based on labeled source field data and machine labeled target field data, so that the accuracy of word segmentation of the target field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the target domain participler comprises a target domain encoder and a first participler, and the target domain participler is configured to: acquiring a second code through the target domain encoder, wherein the second code is acquired according to the second implicit vector; based on the first code and the second code, the optimized word segmentation result of the fourth text segment is obtained through the first word segmentation device, and the target field data can be optimized by using a confrontation network based on the labeled source field data and the machine labeled target field data, so that the word segmentation accuracy of the target field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the countermeasure network further comprises a source domain encoder and a second word splitter, wherein: the source domain encoder acquires a third code, and the third code is acquired according to the first implicit vector; based on the first code and the third code, the second word segmentation device obtains the optimized word segmentation result of the third text segment, and can optimize target field data by using a confrontation network based on labeled source field data and machine labeled target field data, so that the accuracy of word segmentation of the source field data is improved.

According to the technical scheme provided by the embodiment of the disclosure, the marked first text data and the unmarked second text data are obtained; performing machine labeling on the second text data without the label by using the first text data with the label; inputting the labeled first text data and the machine labeled second text data into a countermeasure network, and acquiring optimized second text data; and outputting an optimized word segmentation result of the second text data based on the optimized second text data, and optimizing a word segmentation result of the second text data by using an adversarial network based on the labeled first text data and the machine-labeled second text data to improve the accuracy of word segmentation of the second text data.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

Other labels, objects and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:

FIG. 1 illustrates a flow diagram of a natural language processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a data processing flow in a natural language processing method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a manner of obtaining target domain data of a machine annotation in a natural language processing method according to an embodiment of the disclosure;

FIG. 4 shows a schematic diagram of a countermeasure network in accordance with an embodiment of the present disclosure;

FIG. 5 shows a block diagram of a natural language processing device according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of an electronic device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a computer system suitable for implementing a natural language processing method or a text processing method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.

In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of labels, numbers, steps, actions, components, parts, or combinations thereof disclosed in the present specification, and are not intended to preclude the possibility that one or more other labels, numbers, steps, actions, components, parts, or combinations thereof are present or added.

It should be further noted that the embodiments and labels in the embodiments of the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

According to the technical scheme provided by the embodiment of the disclosure, marked source field data and unmarked target field data are obtained; performing machine labeling on the target field data without the label by using the labeled source field data; and inputting the labeled source field data and the machine labeled target field data into a countermeasure network, acquiring optimized target field data, and optimizing the target field data by using the countermeasure network based on the labeled source field data and the machine labeled target field data to improve the word segmentation accuracy of the target field data. The data automatically labeled by the anti-network to the machine is used for noise reduction without any supervision information (such as a domain dictionary, labeling information and the like) on the target field, so that the semantic information of the source target field is effectively utilized, the influence caused by insufficient dictionary is reduced, and the expandability of the model in the new field is greatly improved.

Fig. 1 illustrates a flow diagram of a natural language processing method according to an embodiment of the present disclosure. As shown in fig. 1, the natural language processing method includes the following steps S110, S120, and S130:

in step S110, labeled source domain data and unlabeled target domain data are acquired.

In step S120, machine labeling is performed on the target domain data without labeling by using the labeled source domain data.

In step S130, the labeled source domain data and the machine labeled target domain data are input to the countermeasure network, and the optimized target domain data is obtained.

In one embodiment of the present disclosure, the source domain data and the target domain data belong to a natural language, and may be data in a text form. In one embodiment of the present disclosure, the annotated source domain data comprises annotation information for the source domain data. In one embodiment of the present disclosure, the optimized target domain data may include annotation information for the target domain data. For example, labeling may refer to tagging each character with a label that indicates its specific location in the word. For example, the beginning, middle and end of a word may be represented at B, M, E, respectively, with S representing the word as a single word.

In one embodiment of the present disclosure, the source domain and the target domain are two different domains. There may be some domain-independent shared words between the two, such as "is", "studies", "findings", and so on. There are also some domain-specific words, words that are common in the source domain are very rare or even non-existent in the target domain, and words that are common in the target domain are very rare or even non-existent in the source domain. For example, the source domain may be a news domain and the target domain may be a medical domain, with words that are common in both domains being very different. Therefore, the model trained on the source domain text data in the prior art is difficult to apply to the target domain text data.

In one embodiment of the present disclosure, the labeled source region text data is labeled text data, for example, the labeled information corresponding to the labeled text "dao xiao has the worst economic decline" may be "bmebebebebebebe", and the labeled information is the word segmentation result of the labeled source region text data. And the label-free target domain data is text data to be segmented without such a label.

Fig. 2 is a schematic diagram illustrating a data processing flow in a natural language processing method according to an embodiment of the present disclosure.

As shown in fig. 2, in one embodiment of the present disclosure, the natural language processing method uses two data sets, one data set includes labeled (with artificial labels) source domain text data, and the labeled source domain text data includes the source domain text data and the segmentation result of the source domain text data. The other data set includes unlabeled (no label) target domain text data. The goal of model training is to perform word segmentation on the target domain data. First, the source domain text data and the target domain text data are configured to execute the step S120 of referring to the context description through a processing module, and a word segmentation result of the target domain text sequence automatically labeled by the machine is constructed by using the source domain knowledge and the statistical information. Then, the labeled source field text data and the target field text data labeled by a machine (with a machine label) can pass through a confrontation network together to perform noise reduction and information reutilization, and the optimized target field text data, namely the optimized word segmentation result of the target field text data, is obtained. The obtained optimized word segmentation result may be output in various manners, for example, word segmentation labels corresponding to the characters in the target field text data one to one may be directly output, or a separator may be added to the target field text data according to the obtained optimized word segmentation result, and the target field text data with the separator may be output. In the whole process, no dictionary or labeling information of the target field is needed to be set manually.

Fig. 3 is a schematic diagram illustrating a manner of acquiring target domain data of a machine annotation in a natural language processing method according to an embodiment of the present disclosure.

As shown in fig. 3, the labeled source domain data and the unlabeled target domain data are matched to obtain source domain specific words, shared words, and target domain data to be labeled. Although the target-domain specific word is shown in fig. 3, this is to explain that the part of the target-domain data that is not labeled in the matching result, other than the shared word, is the target-domain specific word. However, at this time, the target domain specific word is not actually marked, and therefore, the target domain specific word refers to the target domain data to be marked other than the shared word. Mining the shared words may result in a shared word vocabulary. And (3) carrying out new word mining on the data to be annotated in the target field to obtain a specific dictionary of the target field, wherein the specific dictionary of the target field is automatically obtained. And acquiring target field data labeled by a machine according to the shared vocabulary and the target field specific dictionary.

In one embodiment of the present disclosure, the source domain specific word refers to a word that is specific in the source domain, such as a specific word in the aforementioned news domain, e.g., "reporter", "report", and so on. Shared words such as the aforementioned "is," "studies," "findings," and the like. The target domain data to be labeled may be, for example, text of the aforementioned medical domain. By matching the labeled source field data with the unlabeled target field data, the source field specific words, the shared words and the target field data to be labeled can be identified from the matching result because the source field specific words and the shared words are labeled.

In one embodiment of the present disclosure, step S120 may include: matching the labeled source field data with the unlabeled target field data to obtain source field specific words, shared words and target field data to be labeled; obtaining a shared word vocabulary by using the shared words; acquiring a specific dictionary of the target field by using the data to be labeled of the target field; and acquiring target field data labeled by a machine according to the shared word list and the target field specific dictionary.

In an embodiment of the present disclosure, a target-domain specific dictionary may be obtained by processing the target-domain text data based on a statistical manner, that is, labeling of a target-domain specific word is completed. In one embodiment of the present disclosure, the tagging of the shared word in the target domain text data may be achieved by processing the shared word through a predetermined word segmentation model.

In one embodiment of the present disclosure, the predetermined participle model may comprise, for example, a GCNN-CRF model (with bands)Convolutional neural networks with gating mechanisms-conditional random field classification layer models). This is a supervised chinese word segmentation model. For one input sentence s ═ c₁,c₂,…,c_nThe model is intended for each word c_iAnd (4) marking a special label, wherein the label is one of the set { B, M, E, S }. Wherein, B represents that the character is the initial character of the word, and M represents that the character is in the middle of the word. E represents the end of the word, and S represents that the word can be independently formed into words. For each character c_iFirst, it is converted into a word vector e_iThus, the sentence is represented as e_s＝{e₁,e₂,…,e_n}. Then e is put_sThe vector is coded as an implicit vector at sentence level by a GCNN. And finally, sending the implicit vector into a CRF (domain name function) classification layer, calculating the label score of each word, acquiring the probability distribution of the character on a label set, and selecting the label with the maximum probability as the label of the character. Applying a GCNN-CRF model trained based on training data for any one or more other domains to the target domain text data can effectively identify domain-independent words in the target domain text data, which may also be referred to as shared words, e.g., "find", "have", "study", "be", "of", etc.

In an embodiment of the present disclosure, the specific word of the target field text data is obtained by processing the target field text data based on a statistical manner, which may be based on one or more statistical indexes, so as to determine whether a segment of text constitutes a word.

In an embodiment of the present disclosure, the obtaining a target domain specific dictionary by using the target domain to-be-labeled data includes: identifying words in the data to be labeled in the target field by calculating one or more of the following indexes: a condensed score, a free score, a word frequency, and a word frequency-inverse text frequency index.

In one embodiment of the present disclosure, the condensation score may be used to measure the degree of condensation for a segment of text. In one embodiment of the present disclosure, the free score may be used as a feature value to measure whether the text segment can be flexibly manipulated. In one embodiment of the present disclosure, the word frequency or word frequency-inverse text frequency index may be used as an index to measure whether the text segment constitutes a word. In one embodiment of the present disclosure, at least one of the condensation score, the free score, the word frequency, and the word frequency-inverse text frequency index may be used as the aforementioned statistical index.

In one embodiment of the present disclosure, the condensation score is calculated by: acquiring a first text segment from the data to be labeled in the target field; acquiring a first probability of the first text segment appearing in the data to be labeled in the target field as a whole; dividing the first text segment into two sub-segments, and determining a first division mode which enables the product of the probabilities of the two sub-segments appearing in the data to be labeled in the target field to be the largest; and determining the target-domain specific dictionary based on a ratio of the first probability to a product corresponding to the first division manner.

In an embodiment of the present disclosure, a probability that a first text segment in the data to be labeled in the target domain appears in the text of the target domain as a whole may be obtained, and may be referred to as a first probability. In addition, the first text segment may be divided into two sub-segments by a plurality of division methods. In particular, all the segmentations can be traversed. For each segmentation mode, the probability of the two sub-segments respectively appearing in the target field text data is determined, and the product of the two sub-segments is used as a characteristic value. Since there may be multiple segmentation methods, the embodiment of the present disclosure selects the segmentation method with the largest product, i.e., the first segmentation method. And taking the ratio of the first probability to the product corresponding to the first segmentation mode as a condensation score. If a fixed text segment occurs more frequently (the overall probability of occurrence is greater), the likelihood that it is a word is greater. Meanwhile, among the various division methods, the first division method is most likely to occur. Therefore, the ratio of the first probability to the product corresponding to the first segmentation mode is used as the condensation score to measure the degree of condensation of a text segment. For example, for a text fragment "lysozyme", it can be split into "lysozyme" - "enzyme", or into "lysozyme" - "enzyme". For the first segmentation mode, the product of the probabilities of occurrence of "lysis" and "bacterial enzymes" is calculated, respectively, and for the second segmentation mode, the product of the probabilities of occurrence of "lysis" and "enzymes" is calculated, respectively. A larger product indicates a higher probability that the segmentation will occur in the target domain text, and thus the text fragment is more likely to be two separate words that exist according to the segmentation. Since the probability that the same text segment appears as a whole in the text of the same target field is fixed, the larger the product corresponding to the first segmentation mode is, the smaller the condensation score is, and the lower the possibility that the text segment is an independent word is. In an embodiment of the present disclosure, whether the condensation score is greater than or less than a certain threshold may be set to determine the level of condensation of the text segment.

In one embodiment of the present disclosure, the free fraction is calculated by: acquiring a second text segment from the data to be labeled in the target field; determining left neighbor set entropy and right neighbor set entropy of the second text segment; determining the target-domain-specific dictionary based on the lesser of the left-neighbor-set entropy and the right-neighbor-set entropy.

In an embodiment of the present disclosure, for a second text segment in the acquired target domain data to be labeled, left neighborhood word set entropy and right neighborhood word set entropy of the text segment in the target domain text data may be determined. The smaller of which is taken as the free score. Since the text with a high aggregation score is not necessarily a word, but may also be a fixed combination including more than two words, the characteristics of the words should be flexibly applicable in context. The free score can be used as a feature value for measuring whether the text segment can be flexibly used. Whether the text segment is an independent word capable of being flexibly applied can be measured through a threshold comparison mode.

The entropy ES can be determined by:

wherein k is the number of element types in the word set, and j is one element in the word set. As a simple example, for the text data "eating grape skin does not spit grape skin and instead spit grape skin", the word "grape" appears four times, where the left-adjacent character sets are { eat, spit, eat, spit } respectively, and the right-adjacent character sets are { no, skin, down, skin } respectively. According to the formula, the information entropy of the left adjacent characters of the word "grape" is- (1/2) · log (1/2) - (1/2) · log (1/2) ≈ 0.693, and the information entropy of the right adjacent characters thereof is- (1/2) · log (1/2) - (1/4) · log (1/4) - (1/4) · log (1/4) · 1.04. The smaller of these, left neighbor entropy 0.693, can be taken as the free score of "grape". If the free score is large enough, it indicates that the text fragment can be manipulated flexibly, i.e., linked to a variety of different words. If the 'grape skin' is taken as a text segment when being divided, the left adjacent character set is { grape, grape }, and the left adjacent information entropy is-1 · log (1) ═ 0, the text segment is not free, can only be connected with fixed characters, and cannot be an independent word.

In an embodiment of the present disclosure, the obtaining a target domain specific dictionary by using the target domain to-be-labeled data includes: and processing the data to be labeled in the target field through a preset word segmentation model to obtain a dictionary specific to the target field.

In one embodiment of the present disclosure, a predetermined word segmentation model may be utilized to perform new word mining on data to be labeled in a target field to obtain a dictionary specific to the target field.

In one embodiment of the present disclosure, for the target domain data of the sentence containing "scientific research of lysozyme", the specific word "lysozyme" may be first mined out in a statistical-based manner, labeled as "BME", and then participled using a trained predetermined participle model, and labeled as "SBEBE". In this way, machine automatic labeling can be achieved in the target domain text data without labeling.

In the embodiment of the disclosure, a statistical-based mode and a trained-based mode of a preset word segmentation model are combined, so that target field vocabularies including field specific words and shared words can be mined accurately without any field dictionary or labeled information, and the problem that the model cannot accurately recognize words in a test due to the fact that the training data has 'unknown words' in the prior art is effectively solved.

In an embodiment of the present disclosure, a text segment may be obtained from the data to be labeled in the target field, a word frequency or a word frequency-inverse text frequency index of the text segment is determined as feature information of the text segment, and a word segmentation result of the data in the target field is determined based on the feature information.

In an embodiment of the present disclosure, for a text segment in data to be labeled in a target field, a word frequency or a word frequency-inverse text frequency index may be used as an index for measuring whether the text segment constitutes a word. Wherein, Term Frequency (TF) refers to the number of times the text segment appears in the target domain text data; the inverse text frequency Index (IDF) may be obtained by dividing the total number of files by the number of files containing the text segment and taking the obtained quotient as a logarithm. The two may be used independently or in combination. For example, word frequency may be used alone, and if a text segment is a word, it should appear frequently in the corpus. Whether the word frequency reaches a certain number can be measured through a threshold value, and the word frequency is used as a reference index to determine whether the text segment becomes a word. For another example, the degree of importance of the text segment may be determined by an index of word frequency-inverse text frequency index ("TF-IDF", i.e. TF × IDF), and used to determine whether the text segment is a specific word in the target domain, and a threshold may also be set for comparison.

In the embodiment of the present disclosure, part or all of the aggregation score, the free score, the word frequency, and the word frequency-inverse text frequency index discussed above may be used as a basis for processing the target domain text data to obtain the word segmentation result, as needed. It will be appreciated by those skilled in the art in light of the teachings of the present disclosure that various thresholds may be set on the condensation score, the free score, the word frequency, and the word frequency-inverse text frequency index to obtain the word segmentation results.

Fig. 4 shows a schematic diagram of a countermeasure network in accordance with an embodiment of the present disclosure. As shown in fig. 4, in an embodiment of the present disclosure where the countermeasure network includes a shared encoder, an arbiter, and a target domain segmenter, step S130 may include the steps of: coding the third text segment in the marked source field data to obtain a first hidden vector; coding a fourth text segment in the target field data labeled by the machine to obtain a second implicit vector; the shared encoder randomly selects a first implicit vector or a second implicit vector as a first code to be input into the discriminator so that the discriminator judges whether the first code is the first implicit vector or the second implicit vector; and the target field word segmentation device acquires optimized target field data according to the result of the discriminator.

In an embodiment of the present disclosure, the input of the countermeasure network is from the source domain data and the target domain data, where the source domain data originally has the tagging information, that is, the segmentation result of the source domain text data, and the target domain text data is automatically tagged by the machine through the operation S120 described above, so as to obtain the segmentation result of the target domain text data. The third text segment and the fourth text segment input to the countermeasure network are respectively intercepted from the source domain data and the target domain data, and may be intercepted, for example, at the granularity of a whole paragraph, a whole sentence, or a sentence.

In one embodiment of the present disclosure, the shared encoder and the arbiter are trained as a pair of antagonistic characters. The shared encoder may be, for example, a GCNN model, or may be other neural network models, such as an RNN (recurrent neural network), a transform language model, and so on. The discriminator is a two-classifier. A shared encoder randomly obtains a text segment s (a third text segment or a fourth text segment) and a word segmentation result from the source field text data and the target field text data, and encodes the text segment s (the third text segment or the fourth text segment) to obtain an implicit vector H (a first implicit vector or a second implicit vector) as a first code; and judging whether the first code is a first implicit vector or a second implicit vector according to the first code by a discriminator so as to determine whether the text segment corresponding to the first code is a third text segment from the source domain data or a fourth text segment from the target domain data. For example, the discriminator may generate a probability that the input text segment is from the source domain or the target domain. If the judgment is correct, improving the parameters of the shared encoder, and hopefully, the discriminator cannot make correct judgment; if the judgment is wrong, parameters of the discriminator are improved, and the source of the text segment is expected to be correctly discriminated. In doing so, the shared encoder may be trained to focus on domain-independent features, thereby facilitating proper word segmentation of the target domain text data. In the confrontation training process of the shared encoder and the discriminator, the source field text data and the target field text data can be continuously and repeatedly utilized, so that the optimized target field data is obtained, and the word segmentation effect of the model is improved.

In one embodiment of the present disclosure, before text segment s enters the shared encoder, s needs to be converted into a coded sequence X (e.g., a word vector), which is further encoded by the shared encoder to obtain H.

In one embodiment of the present disclosure, for text segment s or s from target domain text data_tgtAnd outputting the generated first code H to a target field word segmentation device by the trained shared encoder, and obtaining an optimized word segmentation result under the combined action of the first code H and the word segmentation results of the text segments.

In an embodiment of the disclosure, the target domain segmenter includes a target domain encoder and a first segmenter, and obtaining optimized target domain data according to the result of the discriminator may include the following steps: acquiring a second code through the target domain encoder, wherein the second code is acquired according to the second implicit vector; and acquiring an optimized word segmentation result of the fourth text segment through a first word segmentation device based on the first code and the second code.

In one embodiment of the present disclosure, the target domain encoder may be a GCNN model, but may also be other neural network models, such as RNN, transform language model, and so on. The first tokenizer may be a CRF (conditional random field) model. Text segment s from target domain text data_tgtIs converted into a coding sequence X_t. The target domain encoder may encode X_tFurther encoded into a second code H_t. TheText segment s_tgtThe first code H may also be generated as s input to the shared encoder. The first word segmenter may encode the second code H_tAnd the first code H are spliced into a vector H (x)_t) Input into the conditional random field model to obtain optimized target domain data, i.e., updated segmentation result y_t。

In one embodiment of the present disclosure, since parameters of the shared encoder are changed during the training process, the first encoding H generated by the same text segment is different, and thus the word segmentation result output by the first word segmenter may also be different. The result output by the first word segmentation device in the training process can be used for updating the word segmentation result in the target field text data, and the updated word segmentation result can be used for training again. And after the training is finished, the word segmentation result output by the first word segmentation device is the finally determined optimized word segmentation result.

In one embodiment of the disclosure, the countermeasure network further includes a source domain encoder and a second tokenizer, and the natural language processing method may further include the steps of: the source domain encoder acquires a third code, and the third code is acquired according to the first implicit vector; and obtaining an optimized word segmentation result of the third text segment through a second word segmentation device based on the first code and the third code.

In one embodiment of the present disclosure, the source-domain encoder may be a GCNN model, but may also be other neural network models, such as RNN, transform language model, and so on. The second tokenizer may be a CRF model. FromText segment s of source domain text data_srcIs converted into a coding sequence X_s. The source-domain encoder may encode X_sFurther encoded into a third code H_s. The text passage s_srcThe first code H is also generated as s input to the shared encoder. The second word segmenter may encode the third code H_sAnd the first code H are spliced into a vector H (x)_s) Inputting the optimized word segmentation result y into a conditional random field model to obtain a text segment of the source field text data_s。

It should be noted that the embodiment of the present disclosure adopts a whole set of schemes from target domain word segmentation to optimized word segmentation result by a series of means such as obtaining source domain (text) data and word segmentation result of the source domain (text) data and optimizing the target domain data through a countermeasure network for natural language word segmentation of the target domain (text) data, so as to improve the accuracy of the target domain data word segmentation.

The embodiment of the present disclosure further provides a text processing method, including: acquiring first text data with a label and second text data without the label; performing machine labeling on the second text data without the label by using the first text data with the label; inputting the labeled first text data and the machine labeled second text data into a countermeasure network, and acquiring optimized second text data; and outputting the optimized word segmentation result of the second text data based on the optimized second text data.

In one embodiment of the present disclosure, the annotation may be, for example, an annotation made in the manner of the above-described { B, M, E, S } annotation. In one embodiment of the present disclosure, the first text data may refer to the annotated source domain text data in the previous embodiment.

In one embodiment of the present disclosure, the second text data may refer to the unlabeled target domain text data in the foregoing embodiment. In an embodiment of the present disclosure, the processing manner for obtaining the label of the second text data may be, for example, obtaining a word segmentation result of the second text data by processing the second text data based on a statistical manner; and/or processing the second text data through a preset word segmentation model to obtain a word segmentation result of the second text data. The predetermined participle model may be, for example, the GCNN-CRF model described above.

In an embodiment of the present disclosure, the countermeasure network may be implemented, for example, as a countermeasure network having a structure similar to that illustrated in fig. 4, and the result of the countermeasure network and the process of word segmentation optimization are similar to those of the foregoing embodiment and will not be described herein again.

In one embodiment of the present disclosure, the obtained optimized word segmentation result may be output in various ways, for example, word segmentation tags corresponding to words in the second text data one to one may be directly output, or a separator may be added to the second text data according to the obtained optimized word segmentation result, and the second text data with the separator may be output.

A natural language processing apparatus according to an embodiment of the present disclosure is described below with reference to fig. 5.

Fig. 5 illustrates a block diagram of a natural language processing apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the natural language processing apparatus 500 includes an acquisition module 510, a labeling module 520, and an input module 530.

The retrieval module 510 is configured to retrieve labeled source domain data and unlabeled target domain data.

The labeling module 520 is configured to perform machine labeling on the label-free target domain data using the labeled source domain data.

The input module 530 is configured to input the labeled source domain data and the machine labeled target domain data into the countermeasure network, and obtain optimized target domain data.

In one embodiment of the present disclosure, the labeling module 520 includes:

In one embodiment of the present disclosure, the unique dictionary vocabulary retrieving sub-module is further configured to: identifying words in the data to be labeled in the target field by calculating one or more of the following indexes: a condensed score, a free score, a word frequency, and a word frequency-inverse text frequency index.

In one embodiment of the present disclosure, the target-domain specific dictionary word retrieval sub-module is further configured to: and processing the data to be labeled in the target field through a preset word segmentation model to obtain a dictionary specific to the target field.

In one embodiment of the present disclosure, the countermeasure network includes a shared encoder, a discriminator, and a target domain tokenizer, and the input module includes: the first coding submodule is configured to code the third text segment in the labeled source field data to obtain a first implicit vector; the second coding submodule is configured to code a fourth text segment in the target field data labeled by the machine to obtain a second implicit vector, wherein the shared encoder randomly selects a first implicit vector or a second implicit vector as a first code and inputs the first implicit vector or the second implicit vector into the discriminator so that the discriminator judges whether the first code is the first implicit vector or the second implicit vector; and the target field word segmentation device acquires optimized target field data according to the result of the discriminator.

In one embodiment of the disclosure, the target domain tokenizer comprises a target domain encoder and a first tokenizer, the target domain tokenizer configured to: acquiring a second code through the target domain encoder, wherein the second code is acquired according to the second implicit vector; and acquiring an optimized word segmentation result of the fourth text segment through a first word segmentation device based on the first code and the second code.

In one embodiment of the present disclosure, the countermeasure network further includes a source-realm encoder and a second tokenizer, wherein: the source domain encoder acquires a third code, and the third code is acquired according to the first implicit vector; and based on the first code and the third code, the second word segmentation device obtains an optimized word segmentation result of the third text segment.

It will be understood by those skilled in the art that the technical solution described with reference to fig. 5 may be combined with the embodiment described with reference to fig. 1 to 4, so as to have the technical effects achieved by the embodiment described with reference to fig. 1 to 4. For details, reference may be made to the description made above with reference to fig. 1 to 4, and details thereof are not repeated herein.

The foregoing embodiments describe the internal functions and structures of the natural language processing apparatus, and in one possible design, the structures of the natural language processing apparatus may be implemented as an electronic device, such as shown in fig. 6, and the electronic device 600 may include a processor 601 and a memory 602.

The memory 602 is used for storing a program that supports an electronic device to execute the natural language processing method or the code generation method in any of the above embodiments, and the processor 601 is configured to execute the program stored in the memory 602.

In one embodiment of the present disclosure, the memory 602 is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor 601 to implement the steps of:

acquiring marked source field data and unmarked target field data;

In an embodiment of the present disclosure, the performing machine labeling on the label-free target domain data by using the labeled source domain data includes:

obtaining a shared word vocabulary by using the shared words;

In an embodiment of the present disclosure, the obtaining a target domain specific dictionary by using the target domain to-be-labeled data includes:

In one embodiment of the present disclosure, the condensation score is calculated by:

acquiring a first text segment from the data to be labeled in the target field;

In one embodiment of the present disclosure, the free fraction is calculated by:

In an embodiment of the present disclosure, the countermeasure network includes a shared encoder, a discriminator, and a target domain tokenizer, and the inputting the labeled source domain data and the machine labeled target domain data into the countermeasure network to obtain optimized target domain data includes:

In an embodiment of the present disclosure, the obtaining optimized target domain data according to the result of the discriminator includes:

In one embodiment of the disclosure, the counterpoise network further includes a source-realm encoder and a second tokenizer, the method further comprising:

In one embodiment of the present disclosure, the memory 602 is used to store one or more computer instructions, wherein the one or more computer instructions are further executed by the processor 601 to implement the steps of:

acquiring first text data with a label and second text data without the label;

Exemplary embodiments of the present disclosure also provide a computer storage medium for storing computer software instructions for the positioning apparatus, which includes a program for executing any of the above embodiments, thereby providing technical effects brought by the method.

As shown in fig. 7, the computer system 700 includes a processor (CPU, GPU, TPU, FPGA, etc.) 701, which can execute various processes in the embodiments shown in the above-described figures according to a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the system 700 are also stored. The processor 701, the ROM702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to embodiments of the present disclosure, the methods described above with reference to the figures may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the methods of the figures. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.

As another aspect, the present disclosure also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer-readable storage medium stores one or more programs which are used by one or more processors to perform the methods described in the present disclosure, thereby providing technical effects brought by the methods.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A natural language processing method, comprising:

acquiring marked source field data and unmarked target field data;

2. The method of claim 1, wherein the machine labeling the unlabeled target domain data with the labeled source domain data comprises:

obtaining a shared word vocabulary by using the shared words;

3. The method according to claim 2, wherein the obtaining a target domain specific dictionary by using the target domain data to be labeled comprises:

4. The method of claim 3, wherein the condensation score is calculated by:

acquiring a first text segment from the data to be labeled in the target field;

5. The method of claim 3, wherein the free fraction is calculated by:

6. The method according to claim 2, wherein the obtaining a target domain specific dictionary by using the target domain data to be labeled comprises:

7. The method according to any one of claims 1 to 6, wherein the countermeasure network comprises a shared encoder, a discriminator and a target domain tokenizer, and the inputting the labeled source domain data and the machine labeled target domain data into the countermeasure network to obtain the optimized target domain data comprises:

8. The method of claim 7, wherein the target domain tokenizer comprises a target domain coder and a first tokenizer, and the obtaining optimized target domain data according to the result of the discriminator comprises:

9. The method of claim 7, wherein the counterpoise network further comprises a source-realm encoder and a second tokenizer, the method further comprising:

10. A natural language processing apparatus, comprising:

11. The apparatus of claim 10, wherein the labeling module comprises:

12. The apparatus of claim 11, wherein the unique dictionary vocabulary retrieval sub-module is further configured to:

13. The apparatus of claim 12, wherein the condensation score is calculated by:

acquiring a first text segment from the data to be labeled in the target field;

14. The apparatus of claim 12, wherein the free fraction is calculated by:

15. The apparatus of claim 11, wherein the target domain-specific dictionary word retrieving sub-module is further configured to:

16. The apparatus of any one of claims 10-15, wherein the countermeasure network comprises a shared encoder, a discriminator, and a target domain tokenizer, and the input module comprises:

17. The apparatus of claim 16, wherein the target domain tokenizer comprises a target domain encoder and a first tokenizer, and wherein the target domain tokenizer is configured to:

18. The apparatus of claim 16, wherein the counterpoise network further comprises a source-realm encoder and a second tokenizer, wherein:

19. A method of text processing, comprising:

acquiring first text data with a label and second text data without the label;

20. An electronic device comprising a memory and a processor; wherein the content of the first and second substances,

the memory is to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of any of claims 1-9 or claim 19.

21. A readable storage medium having stored thereon computer instructions, characterized in that the computer instructions, when executed by a processor, implement the method according to any one of claims 1-9 or claim 19.