CN111597809B

CN111597809B - Training sample acquisition method, model training method, device and equipment

Info

Publication number: CN111597809B
Application number: CN202010519680.3A
Authority: CN
Inventors: 郑孙聪; 徐程程
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2023-08-08
Anticipated expiration: 2040-06-09
Also published as: CN111597809A

Abstract

The application discloses a training sample acquisition method, a model training device and training equipment, and belongs to the technical field of natural language processing. The method comprises the following steps: acquiring a first training sample set; inputting the sentences of the first sample of the first training sample set into a predictive word segmentation model to obtain a predictive word segmentation result of the sentences of the first sample; according to the state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample, combining the predicted word segmentation result of the first sentence and the first sentence into a second sample, or combining the standard word segmentation result of the first sentence and the first sentence into the second sample. According to the method, training samples with different granularities can be fused, and a plurality of samples with similar granularities are obtained. The effect of improving the acquisition speed of the training samples and enriching the training samples is achieved.

Description

Training sample acquisition method, model training method, device and equipment

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a training sample acquiring method, a model training device, and a device.

Background

The word segmentation model is a model capable of splitting words in sentences. The server can train the word segmentation model through training samples so as to improve the accuracy of word segmentation results of the word segmentation model.

In a training sample acquiring method of the related art, a terminal control server acquires a plurality of sentences at first, then a worker performs word segmentation on each sentence with a certain granularity (granularity is an amount representing the fine degree of word segmentation), a standard word segmentation result of each sentence is obtained, and each sentence and the standard word segmentation result form a sample for training a word segmentation model.

However, the training sample acquiring method has a slow process of acquiring the training sample, and is difficult to acquire the training sample rapidly.

Disclosure of Invention

The embodiment of the application provides a training sample acquisition method, a model training device and training equipment. The technical scheme is as follows:

according to an aspect of the present application, there is provided a training sample acquisition method including:

acquiring a first training sample set, wherein the first training sample set comprises one or more first samples, and the first samples comprise sentences and standard word segmentation results of the sentences;

Inputting sentences of the first sample of the first training sample set into a predictive word segmentation model to obtain a predictive word segmentation result of the sentences of the first sample;

according to the state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample, combining the predicted word segmentation result of the first sentence and the first sentence into a second sample, or combining the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence into the second sample.

In another aspect, a model training method is provided, the model training method including:

combining the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample or combining the first sentence and the standard word segmentation result of the first sentence into the second sample according to the state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample;

And training the predictive word segmentation model according to the second sample.

In another aspect, there is provided a training sample acquiring apparatus comprising:

the first acquisition module is used for acquiring a first training sample set, wherein the first training sample set comprises one or more first samples, and the first samples comprise sentences and standard word segmentation results of the sentences;

the second acquisition module is used for inputting sentences of the first sample of the first training sample set into a predictive word segmentation model to obtain a predictive word segmentation result of the sentences of the first sample;

the combination module is used for combining the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample or combining the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence into the second sample according to the state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample.

In another aspect, there is provided a model training apparatus including:

a third obtaining module, configured to obtain a first training sample set, where the first training sample set includes one or more first samples, and the first samples include sentences and standard word segmentation results of the sentences;

A fourth obtaining module, configured to input a sentence of the first sample in the first training sample set into a predictive word segmentation model, and obtain a predictive word segmentation result of the sentence of the first sample;

the sample generation module is used for combining the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample or combining the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence into the second sample according to the state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample;

and the training module is used for training the predictive word segmentation model according to the second sample.

In another aspect, a training sample acquisition device is provided that includes a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set that is loaded and executed by the processor to implement the method described above.

In another aspect, a computer storage medium having at least one instruction, at least one program, code set, or instruction set stored therein is provided, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement a method as described above.

The beneficial effects that technical scheme that this application embodiment provided include at least:

the training sample acquisition method is characterized in that a sentence of a first sample in a first training sample set is input into a predictive word segmentation model to obtain a predictive word segmentation result, and the predictive word segmentation result is combined with the first sentence or the standard word segmentation result is combined with the first sentence according to the state relation between the predictive word segmentation result of the first sentence and the standard word segmentation result of the first sentence, so that a second sample with similar granularity to the predictive word segmentation result of the predictive word segmentation model is obtained. Therefore, training samples with different granularities can be fused, and a plurality of samples with similar granularities are obtained. The effect of improving the acquisition speed of the training samples and enriching the training samples is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment of a training sample acquisition method provided in an embodiment of the present application;

FIG. 2 is a flowchart of a training sample acquisition method according to an embodiment of the present application;

FIG. 3 is a flowchart of another training sample acquisition method provided by an embodiment of the present application;

FIG. 4 is a flow chart of a model training method provided in an embodiment of the present application;

FIG. 5 is a flow chart of another model training method provided by an embodiment of the present application;

FIG. 6 is a block diagram of a training sample acquisition device provided in an embodiment of the present application;

FIG. 7 is a block diagram of a model training apparatus provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a training sample acquiring device according to an embodiment of the present application.

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The training sample acquisition method provided by the embodiment of the application relates to a natural language processing (NatureLanguage Processing, NLP) technology, and the NLP technology is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. NLP technology is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. NLP techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The training sample obtained by the training sample obtaining method provided by the embodiment of the application can train the word segmentation model, and the word segmentation model after training can segment a sentence or a word, namely, text processing is carried out, so that semantic understanding can be carried out on the word or sentence after word segmentation.

The embodiment of the application provides a training sample acquisition method, a model training device and training equipment.

Fig. 1 is a schematic diagram of an implementation environment of a training sample acquiring method according to an embodiment of the present application, where the implementation environment may include a server 11 and a terminal 12.

The server 11 may be a server or a cluster of servers.

The terminal 12 may be a mobile phone, a tablet computer, a notebook computer, an intelligent wearable device, or other terminals. The terminal 12 may be connected to the server by wire or wirelessly (fig. 1 shows the case of a connection made wirelessly). The operator may control the server 11 through the terminal 12, so that the server 11 performs the seed training sample acquiring method provided in the embodiment of the present application.

Fig. 2 is a flowchart of a training sample acquiring method according to an embodiment of the present application. The training sample acquiring method can be applied to the server in the implementation environment, and the embodiment of the application is described herein by taking the application to the server as an example. The training sample acquisition method may include:

step 201, a first set of training samples is obtained. The first training sample set includes one or more first samples including sentences and standard word segmentation results for the sentences.

Step 202, inputting the sentences of the first sample of the first training sample set into a predictive word segmentation model to obtain the predictive word segmentation result of the sentences of the first sample.

Step 203, combining the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample or combining the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence into a second sample according to the state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample.

In summary, the embodiment of the application provides a training sample acquisition method, which inputs a sentence of a first sample in a first training sample set into a predictive word segmentation model to obtain a predictive word segmentation result, combines the predictive word segmentation result with the first sentence according to a state relationship between the predictive word segmentation result of the first sentence and a standard word segmentation result of the first sentence, or combines the standard word segmentation result with the first sentence to obtain a second sample with similar granularity to the predictive word segmentation result of the predictive word segmentation model. Therefore, training samples with different granularities can be fused, and a plurality of samples with similar granularities are obtained. The effect of improving the acquisition speed of the training samples and enriching the training samples is achieved.

Fig. 3 is a flowchart of another training sample obtaining method according to an embodiment of the present application, where the training sample obtaining method may be applied to the server in the above-mentioned implementation environment, and the embodiment of the present application is described herein by taking application to the server as an example. As can be seen with reference to fig. 3, the training sample acquisition method may include:

step 301, a second set of training samples is obtained. The second training sample set includes a plurality of initial samples of a target granularity.

The plurality of initial samples in the second training sample set include sentences and standard word segmentation results of the sentences, and the standard word segmentation results can be considered as a correct word segmentation result. The granularity of the plurality of initial samples in the second training sample set may all be a target granularity.

In an exemplary embodiment, the second training sample set may belong to an open source training sample set (the open source training sample set is a public sample set, which may include a plurality of training samples), and the server may obtain the second training sample set from the open source training sample set, so that the time for obtaining the training sample may be reduced, and the training sample may be obtained more simply.

By way of example, the second set of training samples may include a modern chinese corpus and a chinese language processing package (Han Language Processing, hanLP).

In this embodiment of the present application, word segmentation may refer to a process of dividing a continuous word sequence (a word sequence is a sentence) into word sequences formed by a plurality of words according to a certain specification. Words can be the smallest meaning language component capable of independent movement, chinese is a writing unit based on words, no obvious distinguishing mark exists between words, and therefore Chinese character sequences can be processed through word segmentation. Word segmentation is a basic technology of text processing in NLP technology, and refers to a process of segmenting a Chinese character sequence into individual words.

It should be noted that the same character string expressing the same meaning in the chinese character sequence may have different word segmentation results, that is, different word segmentation granularity. May also be referred to as particle size differences in embodiments of the present application. For example, for the same string "central restaurant", the first word segmentation result is "central restaurant/", the second word segmentation result is "central/meal/store/", both word segmentation results are correct, but the second word segmentation result is less granular than the first word segmentation result. If the granularity is larger, only specific Chinese character sequences can be searched for corresponding results. If the granularity is small, the accuracy of subsequent text processing is affected.

The target granularity in step 301 may be selected by an operator, and is a word segmentation granularity that is more suitable for practical situations (such as a scene where the word segmentation model is applicable).

The embodiment of the present application is described by taking the application to a server as an example, but is not limited thereto.

Step 302, training an initial word segmentation model according to the second training sample set to obtain a predicted word segmentation model.

The server can train the initial word segmentation model through the second training sample set to obtain a prediction word segmentation model, and the prediction word segmentation model is trained by a plurality of initial samples with target granularity, so that the prediction word segmentation model can segment sentences with the target granularity and obtain word segmentation results with approximate granularity to the target granularity.

The initial word segmentation model includes a bi-directional encoder representation (Bidirectional Encoder Representations From Transformers, BERT) model of a transformer, a Long Short-Term Memory (LSTM) model, or a conditional random field (Conditional Random Fields, CRF) model. In one exemplary embodiment, the initial word segmentation model is a BERT model.

The initial word segmentation model may be, for example, a BERT model.

The number of samples in the second training sample set may be smaller, so as to reduce the difficulty of obtaining the samples. Furthermore, the predictive word segmentation model can be a preliminarily trained model, the word segmentation capability of the model can not meet the design requirement, and the predictive word segmentation model can be continuously trained through samples meeting the target granularity.

Step 303, obtaining at least two sub-training sets. The at least two sub-training sets are in one-to-one correspondence with at least two different granularities, and any one of the at least two sub-training sets comprises at least one first sample of the granularity corresponding to any one of the sub-training sets.

Wherein each first sample includes a sentence and a standard word segmentation result for the sentence.

The at least two sub-training sets are in one-to-one correspondence with at least two different granularities, and the granularity corresponding to any one of the at least two sub-training sets can be the same as or different from the target granularity. Any sub-training set may comprise any open source training sample set.

Illustratively, the sub-training set a includes 50 first samples of a first granularity, the sub-training set B includes 100 first samples of a second granularity, and the first granularity and the second granularity are different granularities.

At step 304, at least two sub-training sets are combined into a first training sample set.

Wherein the first training sample set may comprise at least two first samples. In the embodiment of the application, the server may combine a plurality of sub-training sample sets with different granularities into the first training sample set.

For example, the sub-training set a includes 50 first samples with a first granularity, the sub-training set B includes 100 first samples with a second granularity, and the first training sample set obtained by combining the sub-training set a and the sub-training set B includes 50 first samples with the first granularity and 100 first samples with the second granularity.

Step 305, inputting the sentence of the first sample in the first training sample set into the predictive word segmentation model to obtain the predictive word segmentation result of the sentence of the first sample.

The server inputs sentences of at least two first samples with different granularity of the first training sample set into a predictive word segmentation model to obtain at least two predictive word segmentation results of the sentences of the at least two first samples. The granularity of the at least two predictive word segmentation results is more consistent with the target granularity.

For example, sentences of each of the 150 first samples in the 50 first samples and the 100 second samples can be input into the predictive word segmentation model, so that 150 word segmentation results consistent with the target granularity are obtained.

The granularity of the samples in the first training sample set may not be the same as the target granularity, so the samples in the first training sample set cannot be used to train the predictive word segmentation model directly.

Step 306, combining the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample or combining the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence into a second sample according to the state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample.

The first sentence is any one of at least two first sample sentences. The standard word segmentation results of the at least two first samples may be different from the granularity of the predicted word segmentation results, so that the standard word segmentation results of the at least two first samples and the predicted word segmentation results can be combined to obtain a second sample, the granularity of the word segmentation results of the second sample is more consistent with the target granularity, and the second sample can be used for continuously training the predicted word segmentation model to improve the word segmentation performance of the predicted word segmentation model.

The number of the second samples can be consistent with that of the first samples in the first training sample set, and the first training sample set is composed of a plurality of samples with different granularity, namely, the first training sample set basically has no limitation on the granularity of the samples, a large number of first samples can be conveniently obtained, and the large number of first samples are fused with the predicted word segmentation result of the predicted word segmentation model through the step 306 to obtain a large number of second samples, so that a large number of second samples similar to the target granularity can be obtained without manual calibration, and the efficiency of the training sample obtaining method is greatly improved.

The state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence comprises at least one of the following states:

a consistent state, a split state, a merge state, and a conflict state.

In step 306, the case of combining to obtain the second sample may include the following:

first case: and in response to the predicted word segmentation result and the standard word segmentation result of the first sentence being in a consistent state, combining the first sentence and the predicted word segmentation result of the first sentence into a second sample, wherein in the consistent state, the predicted word segmentation result and the standard word segmentation result of the first sentence are in a consistent state.

The consistent state is used for indicating that the word segmentation granularity corresponding to the standard word segmentation result of the first sentence is the same as the word segmentation granularity corresponding to the predicted word segmentation result of the first sentence, and the word segmentation results of the same word string expressing the same meaning are the same, and at this time, the first sentence and the predicted word segmentation result of the first sentence can be combined into a second sample.

The first sentence may be "a shooter luban to amplify a sign", the standard word segmentation result may be "a shooter/luban/a sign/an amplify/a sign", the predicted word segmentation result may be "a shooter/a luban/a sign/an amplify/a sign", and at this time, the predicted word segmentation result and the standard word segmentation result of the sentence are in a consistent state, and the first sentence and the predicted word segmentation result may be combined into a second sample.

Second case: and in response to the predicted word segmentation result and the standard word segmentation result of the first sentence being in a split state, combining the first sentence and the predicted word segmentation result of the first sentence into a second sample, wherein in the split state, the predicted word segmentation result of the first sentence has at least two words belonging to one word in the standard word segmentation result of the first sentence.

The splitting state is used for indicating that the predicted word segmentation result of the first sentence has at least two words belonging to one word in the standard word segmentation result of the first sentence, and at this time, the first sentence and the predicted word segmentation result can be combined into a second sample. The split state is usually generated because the word segmentation granularity of the standard word segmentation result is larger than the target granularity, the split state is not a word segmentation error with strict meaning, and the first sentence and the predicted word segmentation result can be combined into a second sample.

The first sentence may be "a shooter luban to amplify a sign", the standard word segmentation result may be "a shooter/luban/a sign/an amplify/a sign", and the predicted word segmentation result may be "a shooter/a luban/a sign/an amplify/a sign", where "a luban" and "a ban" in the predicted word segmentation result belong to "a luban" in the standard word segmentation result, that is, the word "a luban" in the standard word segmentation result is split into a last name and a first name "a luban" in the predicted word segmentation result.

Third case: and in response to the predicted word segmentation result and the standard word segmentation result of the first sentence being in a merging state, combining the first sentence and the predicted word segmentation result of the first sentence into a second sample, wherein in the merging state, the standard word segmentation result of the first sentence has at least two words belonging to one word in the predicted word segmentation result of the first sentence.

The merging state is used for indicating that the standard word segmentation result of the first sentence has at least two words belonging to one word in the predicted word segmentation result of the first sentence, and at this time, the first sentence and the predicted word segmentation result can be combined into a second sample. The merging state is usually generated because the word segmentation granularity of the standard word segmentation result is smaller than the target granularity, and the merging state is not a word segmentation error in a strict sense, and the first sentence and the predicted word segmentation result can be combined into a second sample.

The first sentence may be "a shooter luban to amplify the sign," the standard word segmentation result may be "a shooter/luban/a sign/an amplify sign," and the predicted word segmentation result may be "a shooter/a luban/a sign/an amplify sign," where "an amplify" and "an amplify" in the standard word segmentation result belong to "an amplify sign" in the predicted word segmentation result, that is, the word "an amplify sign" in the predicted word segmentation result is composed of two consecutive words "an amplify" and "an amplify" in the standard word segmentation result.

Fourth case: and in response to the predicted word segmentation result and the standard word segmentation result of the first sentence being in a conflict state, combining the first sentence and the standard word segmentation result of the first sentence into a second sample, wherein in the conflict state, the predicted word segmentation result of the first sentence has a wrong word compared with the standard word segmentation result of the first sentence.

The conflict state is used for indicating that the predicted word segmentation result of the first sentence has a wrong word compared with the standard word segmentation result of the first sentence. The conflict state is an incorrect word segmentation state, and because the incorrect words have no semantic information in the sentence background, the predicted word segmentation result of the conflict state is incorrect and is difficult to use, and at this time, the first sentence and the standard word segmentation result can be combined into a second sample.

The first sentence may be "a shooter luban to amplify a word," the standard word segmentation result may be "a shooter/luban/a target/an amplify/amplify a word," the prediction word segmentation result may be "a shooter/a hand robust/a target/an amplify/amplify a word," the word "a shooter" is separated from the prediction word segmentation result, and the word "a hand" and the word "a robust" are combined into a word "a hand robust" which does not exist in the word segmentation result, so that structures of the word "a shooter" and the word "a luban" are damaged, and wrong word segmentation fragments "a shot", "a hand robust" and "a class" are generated.

By using the training sample acquisition method provided by the embodiment of the application, the first samples with different target granularity can be combined through the predictive word segmentation model to obtain the second sample, the granularity of the second sample is similar to the target granularity, the second sample and the initial sample are used as training samples to train the initial word segmentation model to obtain the word segmentation model, the granularity of a word segmentation result output by the word segmentation model is similar to the target granularity, and the word segmentation result output by the word segmentation model can be processed in the next step of an NLP technology. By using the training sample acquisition method provided by the embodiment of the application, the acquisition speed of the training sample can be improved, and the training sample is enriched.

After a plurality of second samples are obtained by the method provided by the embodiment of the application, training of the predictive word segmentation model can be continued through the plurality of second samples.

Fig. 4 is a flowchart of a model training method according to an embodiment of the present application. The model training method can be applied to the server in the implementation environment, and the embodiment of the application is described here by taking the application to the server as an example. The model training method may include:

Step 401, a first set of training samples is obtained. The first training sample set includes one or more first samples including sentences and standard word segmentation results for the sentences.

Step 402, inputting the sentence of the first sample of the first training sample set into the predictive word segmentation model to obtain the predictive word segmentation result of the sentence of the first sample.

Step 403, combining the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample or combining the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence into a second sample according to the state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample.

Step 404, training the predictive word segmentation model according to the second sample.

In summary, the embodiment of the application provides a model training method, which inputs a sentence of a first sample in a first training sample set into a predictive word segmentation model to obtain a predictive word segmentation result, combines the predictive word segmentation result with the first sentence according to a state relationship between the predictive word segmentation result of the first sentence and a standard word segmentation result of the first sentence, or combines the standard word segmentation result with the first sentence to obtain a second sample with similar granularity to the predictive word segmentation result of the predictive word segmentation model, and continues training the predictive word segmentation model according to the second sample. Therefore, training samples with different granularities can be fused to obtain a plurality of samples with similar granularities, and the predictive word segmentation model is continuously trained according to the plurality of samples with similar granularities. The effects of enriching training samples and improving model precision are achieved.

Fig. 5 is a flowchart of another model training method provided in an embodiment of the present application, where the model training method may be applied to the server in the above implementation environment, and the embodiment of the present application is described herein by taking application to the server as an example. As can be seen with reference to fig. 5, the model training method may include:

step 501, a second set of training samples is obtained. The second training sample set includes a plurality of initial samples of a target granularity.

In an exemplary embodiment, the second training sample set may belong to an open source training sample set, and the server may obtain the second training sample set from the open source training sample set, so that the time for obtaining the training sample may be reduced, and the training sample may be obtained more simply.

By way of example, the second set of training samples may include a modern chinese corpus and a chinese language processing package.

The target granularity in step 501 may be selected by an operator, and is a word segmentation granularity that is more suitable for practical situations (such as a scene where the word segmentation model is applicable).

Step 502, training an initial word segmentation model according to the second training sample set to obtain a predicted word segmentation model.

Step 503, a first training sample set is obtained. The first training sample set includes one or more first samples including sentences and standard word segmentation results for the sentences.

The first training sample set may include any open source training sample set. The granularity of the first samples in the first training sample set may be the same as the target granularity or may be different from the target granularity.

The manner in which the first training sample set is obtained in step 503 may refer to steps 303 and 302 in the embodiment shown in fig. 3, which is not described herein.

Step 504, inputting the sentence of the first sample in the first training sample set into the predictive word segmentation model to obtain the predictive word segmentation result of the sentence of the first sample.

Step 505, combining the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample or combining the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence into a second sample according to the state relation between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample.

The first sentence is any one of a plurality of first sample sentences. The standard word segmentation results of the plurality of first samples may be different from the granularity of the predicted word segmentation results, so that the standard word segmentation results of the plurality of first samples and the predicted word segmentation results can be combined to obtain a second sample, the granularity of the word segmentation results of the second sample is more consistent with the target granularity, and the predicted word segmentation model can be continuously trained by using the second sample, so that the word segmentation performance of the predicted word segmentation model is improved.

The number of the second samples can be consistent with that of the first samples in the first training sample set, and the first training sample set can be composed of a plurality of samples with different granularity, namely, the first training sample set basically has no limitation on the granularity of the samples, a large number of first samples can be conveniently obtained, and the large number of first samples are fused with the predicted word segmentation result of the predicted word segmentation model through the step 505 to obtain a large number of second samples, so that a large number of second samples similar to the target granularity can be obtained without manual calibration, and the efficiency of the training sample obtaining method is greatly improved.

A consistent state, a split state, a merge state, and a conflict state.

In step 505, the case of combining to obtain the second sample may include the following:

Step 505 may also refer to step 306 in the embodiment shown in fig. 3, which is not described herein.

And step 506, training the predictive word segmentation model according to the second sample.

Through steps 503 to 505, the server may obtain a plurality of second samples according to the plurality of first samples in the first training sample set, and then the server may train the predictive word segmentation model according to the plurality of obtained second samples to obtain a word segmentation model, where the word segmentation model may segment a word or sentence to obtain a word segmentation result, and the granularity of the word segmentation result is relatively similar to the granularity (i.e., the target granularity) of the word segmentation result of the second sample. And under the condition that the number of the second samples is greater than the number of the initial samples in the second training sample set, the server can train the initial word segmentation model according to the second samples so as to obtain the word segmentation model.

In an exemplary embodiment, the training sample obtaining method provided in the embodiment of the present application is applied in a server, and a second training sample set is obtained, where the second training sample set includes a plurality of initial samples with a target granularity, and in this embodiment, the initial samples and the first samples both include sentences and standard word segmentation results of the sentences, and training the initial word segmentation model according to the second training sample set to obtain a predictive word segmentation model, where the granularity of the predicted word segmentation results of the sentences output by the predictive word segmentation model is relatively similar to the target granularity. At least two sub-training sets are obtained, the at least two sub-training sets are in one-to-one correspondence with at least two different granularities, the granularity corresponding to at least one sub-training set in the at least two sub-training sets is different from the target granularity corresponding to the second training sample set, the at least two sub-training sets are combined into a first training sample set, and the first training sample set comprises at least two first samples.

And then inputting sentences of at least two first samples in the first training sample set into a predictive word segmentation model to obtain at least two predictive word segmentation results of the sentences of the at least two first samples, wherein the word segmentation granularity of the predictive word segmentation results is similar to the target granularity.

The first sentence can be a 'Sago Luban to amplify the sign', the standard word segmentation result can be a 'Sago/Luban/want/amplify sign', the prediction word segmentation result can be a 'Sago/Luban/want/amplify sign', at this time, the word 'Luban' in the standard word segmentation result is split into a last name and a first name by the prediction word segmentation result, the prediction word segmentation result and the standard word segmentation result of the first sentence are in a split state, the first sentence and the prediction word segmentation result are combined into a second sample, namely, the sentence 'Sago Luban to amplify the sign' and the prediction word segmentation result can be a 'Sago/robust/want/amplify sign' are combined into the second sample.

The server can acquire a plurality of second samples through the method, the server can form a combined training sample set by the plurality of second samples, the granularity of samples in the combined training sample set is similar to the target granularity, and therefore a large number of training samples with the target granularity are acquired according to the training samples with different granularities.

The server can continue training the predictive word segmentation model through the combined training sample set to obtain a word segmentation model, the word segmentation model can segment words or sentences to obtain word segmentation results, and the granularity of the word segmentation results is similar to that of the word segmentation results of the second sample (namely, the target granularity). Because the training samples of the combined training sample set are rich, the word segmentation model has a good word segmentation effect on ambiguous words and new words, namely, the word segmentation model has good generalization.

Fig. 6 is a block diagram of a training sample acquiring device according to an embodiment of the present application. As can be seen with reference to fig. 6, the training sample acquisition device 600 may include:

a first obtaining module 601, configured to obtain a first training sample set, where the first training sample set includes one or more first samples, and the first samples include sentences and standard word segmentation results of the sentences.

The second obtaining module 602 is configured to input a sentence of a first sample of the first training sample set into the predictive word segmentation model, and obtain a predictive word segmentation result of the sentence of the first sample.

The combining module 603 is configured to combine the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample, or combine the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence into a second sample according to a state relationship between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample.

a consistent state, a split state, a merge state, and a conflict state.

The method comprises the steps of providing a first sentence, wherein a consistent state is used for representing that a predicted word segmentation result of the first sentence is consistent with a standard word segmentation result, a split state is used for representing that at least two words belong to one word in the standard word segmentation result of the first sentence, a combined state is used for representing that at least two words belong to one word in the predicted word segmentation result of the first sentence, and a conflict state is used for representing that the predicted word segmentation result of the first sentence is wrong compared with the standard word segmentation result of the first sentence.

In summary, the embodiment of the present application provides a training sample acquiring device, where the training sample acquiring device inputs a sentence of a first sample in a first training sample set into a predictive word segmentation model to obtain a predictive word segmentation result, and combines the predictive word segmentation result with the first sentence or combines the standard word segmentation result with the first sentence according to a state relationship between the predictive word segmentation result of the first sentence and a standard word segmentation result of the first sentence, so as to obtain a second sample with a granularity similar to that of the predictive word segmentation result of the predictive word segmentation model. Therefore, training samples with different granularities can be fused, and a plurality of samples with similar granularities are obtained. The effect of improving the acquisition speed of the training samples and enriching the training samples is achieved.

The combining module 603 is further configured to, in response to the predicted word segmentation result and the standard word segmentation result of the first sentence being in a consistent state, split any one of the split state and the merge state, and combine the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample.

And in response to the predicted word segmentation result and the standard word segmentation result of the first sentence being conflict states, combining the first sentence and the standard word segmentation result of the first sentence into a second sample.

The training sample acquiring device 600 may further include:

and the sample set acquisition module is used for acquiring a second training sample set, and the second training sample set comprises a plurality of initial samples with target granularity.

And the initial model training module is used for training the initial word segmentation model according to the second training sample set to obtain a prediction word segmentation model.

Fig. 7 is a block diagram of a model training apparatus according to an embodiment of the present application. As can be seen with reference to fig. 7, the model training apparatus 700 may include:

a third obtaining module 701, configured to obtain a first training sample set, where the first training sample set includes one or more first samples, and the first samples include sentences and standard word segmentation results of the sentences.

A fourth obtaining module 702, configured to input the sentence of the first sample in the first training sample set into the predictive word segmentation model, and obtain a predictive word segmentation result of the sentence of the first sample.

The sample generation module 703 is configured to combine the predicted word segmentation result of the first sentence and the predicted word segmentation result of the first sentence into a second sample, or combine the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence into a second sample according to a state relationship between the predicted word segmentation result of the first sentence and the standard word segmentation result of the first sentence in the first sample.

And a training module 704, configured to train the predictive word segmentation model according to the second sample.

In summary, the embodiment of the application provides a model training device, which inputs a sentence of a first sample in a first training sample set into a predictive word segmentation model to obtain a predictive word segmentation result, combines the predictive word segmentation result with the first sentence according to a state relationship between the predictive word segmentation result of the first sentence and a standard word segmentation result of the first sentence, or combines the standard word segmentation result with the first sentence to obtain a second sample with similar granularity to the predictive word segmentation result of the predictive word segmentation model, and continuously trains the predictive word segmentation model according to the second sample. Therefore, training samples with different granularities can be fused to obtain a plurality of samples with similar granularities, and the predictive word segmentation model is continuously trained according to the plurality of samples with similar granularities. The effects of enriching training samples and improving model precision are achieved.

Model training apparatus 700 may further comprise:

and a fifth acquisition module, configured to acquire a second training sample set, where the second training sample set includes a plurality of initial samples with a target granularity.

And the model training module is used for training the initial word segmentation model according to the second training sample set to obtain a prediction word segmentation model.

Fig. 8 is a schematic structural diagram of a training sample acquiring device 800 according to an embodiment of the present application, where the training sample acquiring device 800 may be a server. By way of example, as shown in fig. 8, the training sample acquisition device 800 includes a central processing unit (Central Processing Unit, CPU) 801, a Memory 802, and a system bus 803 connecting the Memory 802 and the central processing unit 801, and the Memory 802 may include a computer-readable medium (not shown) such as a hard disk or compact disc-read Only Memory (CD-ROM).

Computer readable storage media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes random access Memory (Random Access Memory, RAM), read-Only Memory (ROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash Memory or other solid state Memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above.

The memory 802 further includes one or more programs, where the one or more programs are stored in the memory and configured to be executed by the CPU to implement the training sample acquiring method provided in the embodiments of the present application.

The embodiment of the application also provides training sample acquisition equipment, which comprises a processor and a memory, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the training sample acquisition method provided by the embodiment of the method.

The present application also provides a computer storage medium having at least one instruction, at least one program, a code set, or an instruction set stored therein, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement a training sample acquisition method as provided in the above method embodiments.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. A method of training sample acquisition, the method comprising:

responding to the condition that the predicted word segmentation result and the standard word segmentation result of a first sentence in the first sample are in a consistent state, splitting any one of the state and the merging state, and combining the first sentence and the predicted word segmentation result of the first sentence into a second sample;

responding to the predicted word segmentation result and the standard word segmentation result of the first sentence to be in a conflict state, and combining the first sentence and the standard word segmentation result of the first sentence into a second sample;

the consistent state is used for indicating that the predicted word segmentation result of the first sentence is consistent with the standard word segmentation result, the split state is used for indicating that at least two words belong to one word in the standard word segmentation result of the first sentence, the combined state is used for indicating that at least two words belong to one word in the predicted word segmentation result of the first sentence, and the conflict state is used for indicating that the predicted word segmentation result of the first sentence has an error word compared with the standard word segmentation result of the first sentence.

2. The method of claim 1, wherein prior to the acquiring the first set of training samples, the method further comprises:

acquiring a second training sample set, wherein the second training sample set comprises a plurality of initial samples with target granularity;

and training an initial word segmentation model according to the second training sample set to obtain the prediction word segmentation model.

3. A method of model training, the method comprising:

Training the predictive word segmentation model according to the second sample;

4. The method of claim 3, wherein prior to the acquiring the first set of training samples, the method further comprises:

5. A training sample acquisition device, the training sample acquisition device comprising:

the combination module is used for responding to the fact that the predicted word segmentation result and the standard word segmentation result of the first sentence in the first sample are in a consistent state, splitting the state and combining any one state of the state, and combining the first sentence and the predicted word segmentation result of the first sentence into a second sample; responding to the predicted word segmentation result and the standard word segmentation result of the first sentence to be in a conflict state, and combining the first sentence and the standard word segmentation result of the first sentence into a second sample;

6. A model training apparatus, characterized in that the model training apparatus comprises:

the sample generation module is used for responding to the fact that the predicted word segmentation result and the standard word segmentation result of the first sentence in the first sample are in a consistent state, splitting any one of the state and the merging state, and combining the first sentence and the predicted word segmentation result of the first sentence into a second sample; responding to the predicted word segmentation result and the standard word segmentation result of the first sentence to be in a conflict state, and combining the first sentence and the standard word segmentation result of the first sentence into a second sample;

the training module is used for training the predictive word segmentation model according to the second sample;

7. A training sample acquisition device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set that is loaded and executed by the processor to implement the method of any one of claims 1 to 4.

8. A computer storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the method of any of claims 1 to 4.