CN112749544B

CN112749544B - Training method and system of paragraph segmentation model

Info

Publication number: CN112749544B
Application number: CN202011583136.1A
Authority: CN
Inventors: 秦文杰
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2024-04-30
Anticipated expiration: 2040-12-28
Also published as: CN112749544A

Abstract

The embodiment of the invention provides a training method of a paragraph segmentation model. The method comprises the following steps: pre-training a neural network model of the paragraph segmentation model by utilizing the universal segmentation data; based on the field segmentation data, training a coding layer related to feature extraction in the pre-trained paragraph segmentation model to obtain a paragraph segmentation model in the adaptation field. The embodiment of the invention also provides a training system of the paragraph segmentation model. Aiming at the problem that a large amount of precision standard data is required to be trained in a specific field, the embodiment of the invention trains on a large amount of easily-acquired general segmentation data, and finally carries out fine adjustment on a small amount of field precision standard data, so that the field adaptation cost can be effectively reduced. Aiming at the problem of sensitivity of the output of the cursor point model. The robustness of the segmentation model is improved, the dependence of the model on an upstream punctuation is reduced, and meanwhile, the output of the upstream punctuation can be corrected.

Description

Training method and system of paragraph segmentation model

Technical Field

The invention relates to the field of intelligent voice, in particular to a training method and system of a paragraph segmentation model.

Background

Paragraph segmentation is now increasingly useful, for example, to convert recordings of lessons spoken by teachers into text, as the text converted from recordings is clustered together. Multiple paragraphs can be separated from a stack of text by paragraph segmentation, so that the experience is better when the user reviews the text.

Currently, the market already has: paragraph segmentation methods of traditional machine learning methods such as SVM (Support Vector Machine) and the like, and paragraph segmentation methods of neural networks such as LSTM (Long Short-Term Memory network) and the like.

Paragraph segmentation is essentially a classification task, where the model needs to make predictions for each sentence in the chapter, and whether the sentence needs to be line-wrapped, thereby completing paragraph segmentation of the text.

A paragraph segmentation method based on SVM mainly learns a hyperplane to separate segmentation sentences and non-segmentation sentences in a high latitude space.

According to the paragraph segmentation method based on the LSTM, an encoder (Encoder) of a deep learning model represented by the LSTM is used for extracting text features, and based on the text features, the prediction of whether each sentence needs line feed is completed.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the related art:

1. The cost of field adaptation is high. The text with paragraph segmentation information is typically a regular news manuscript, and such data is easily available while being objective in size. The model trained based on the method is poor in paragraph segmentation in the new field, and a large amount of texts in the corresponding field need to be manually labeled for retraining. This is because the training model does not contain any general knowledge of text processing and can only rely on a large amount of manually annotated data to learn from scratch.

2. Is sensitive to punctuation output upstream. The upstream punctuation model has poor performance in some field texts, and especially the F1 value of the punctuation ending in a period table has great influence on the downstream segmentation model performance, namely the segmentation model has poor robustness.

Disclosure of Invention

In order to at least solve the problems of high cost, sensitivity to punctuation output with upstream in the prior art of field adaptation.

In a first aspect, an embodiment of the present invention provides a training method for a paragraph segmentation model, including:

Pre-training a neural network model of the paragraph segmentation model by utilizing general segmentation data;

and training a coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain a paragraph segmentation model adapting to the field.

In a second aspect, an embodiment of the present invention provides a training system for a paragraph segmentation model, including:

The model pre-training program module is used for pre-training the neural network model of the paragraph segmentation model by utilizing the universal segmentation data;

and the segmentation model training program module is used for training a coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain a paragraph segmentation model adapting to the field.

In a third aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the training method of the paragraph segmentation model of any of the embodiments of the invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of the training method of the paragraph segmentation model of any of the embodiments of the present invention.

The embodiment of the invention has the beneficial effects that: aiming at the problem that a large amount of precision standard data is required to be trained in a specific field, a pretraining model such as BERT is used for training on a large amount of easily acquired general segmentation data, and finally fine adjustment is performed on a small amount of field precision standard data, so that the cost of field adaptation can be effectively reduced. Aiming at the problem of sensitivity of the output of the cursor point model, we combine the segmentation information with the punctuation output at the upstream to construct new segmentation training data, and count the punctuation quantity distribution before segmentation marking to introduce new sentence segmentation punctuation. The robustness of the segmentation model is improved, the dependence of the model on an upstream punctuation is reduced, and meanwhile, the output of the upstream punctuation can be corrected.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a training method of a segment segmentation model according to an embodiment of the present invention;

FIG. 2 is a flowchart showing the overall steps of a segment segmentation method for training a segment segmentation model according to an embodiment of the present invention;

FIG. 3 is a structural data diagram of a training method of a segment segmentation model according to an embodiment of the present invention;

FIG. 4 is a diagram of error correction effect data of a segment model versus punctuation model results for a segment model training method of a segment segmentation model according to an embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a training system for a segment segmentation model according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of a training method of a segment segmentation model according to an embodiment of the present invention, including the following steps:

s11: pre-training a neural network model of the paragraph segmentation model by utilizing general segmentation data;

s12: and training a coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain a paragraph segmentation model adapting to the field.

The coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction in the paragraph segmentation model of the adaptation field are shared and are used for learning and extracting lexical, syntactic and grammatical features.

In this embodiment, the existing segmentation model adaptation new field has a large demand for data labeling, mainly because the feature extraction part of the underlying text technology in the natural language processing (NLP, natural Language Processing) technology is not considered in the conventional training scheme and can be shared.

For step S11, the general corpus is relatively easy to obtain, taking the currently mainstream neural network convertors in NLP as an example, the network structure generally has several layers, the coding layer at the bottom generally learns general linguistic knowledge such as lexical, syntactic, grammar and the like for feature extraction, and the coding layer at the higher layer learns knowledge related to specific tasks. Therefore, a transducer model trained by mass data is used on a certain task, and the coding layer at the bottom layer can be used on NLP tasks of other small data so as to reduce training expenditure. In this way, the neural network model of the paragraph segmentation model is pre-trained by using massive general segmentation data.

As an embodiment, the neural network model comprises a BERT model.

Consider that the encoder of the transducer has a bi-directional training function because of the self-attention mechanism. Semantic representations at a higher level of sentences than words can be obtained, and in order to adapt to transfer learning under multitasking, BERT designs a more general input layer and output layer. The BERT model was further chosen because the fine tuning cost of the model is small.

For step S12, a small amount of field segmentation data is required to perform fine tuning training on the coding layer (for example, the coding layer of the bottom layer above) related to feature extraction in the segment segmentation model trained in step S11, so that the cost of field adaptation can be effectively reduced.

According to the embodiment, aiming at the problem that a large amount of fine label data is required to be trained in a specific field, a pretraining model such as BERT is used for training on a large amount of easily acquired general segmentation data, and fine adjustment is performed on a small amount of field fine label data, so that the cost of field adaptation can be effectively reduced.

As an implementation manner, in this embodiment, the domain segmentation data is generated by using an up cursor point model and segmentation artificial annotation data, and includes:

inputting the original field data into the cursor point model to obtain segmented field punctuation data;

Receiving segmented manual annotation data for manually annotating the original field data;

determining a sentence ending symbol set based on punctuation types in the segmented manual annotation data, wherein the sentence ending symbol set is used for segmenting the original field data to obtain segmented manual field punctuation data;

And generating field segmentation data with punctuation information and segmentation information based on the field punctuation data and the manual field punctuation data.

As an embodiment, before the inputting the raw domain data into the upstream punctuation model, the method further comprises: and performing punctuation removal processing on the original field data.

In this embodiment, the existing segmentation model has a large dependence on the punctuation output upstream, mainly because the prior art defaults that the punctuation output quality upstream is high, so that the division of sentence units is completely dependent on the punctuation of the end of the sentence of the table, such as the period upstream. However, in actual business scenarios such as spoken dialogue, the output quality of the cursor point model is poor, and especially, the prediction of the sentence ending symbol is inaccurate. Statistics show that in such scenarios, punctuation prediction positions of punctuation models are generally accurate but category predictions are erroneous. The complete decoupling and partial decoupling of the segmentation model and the punctuation model are tried, from the practical standpoint, the partial decoupling is finally selected as the final scheme, and the specific flow of the partial decoupling is shown in fig. 2:

Preparing partial field data, determining whether to perform punctuation removal processing according to whether the field data is provided with punctuation, and if the field data is not provided with punctuation, directly inputting the field data into an upper cursor point model; if the field data has punctuation, the punctuation removal processing is performed first. Because the segmentation process is performed back in the subsequent step, the field data we prepared do not need to be punctuation in this step.

After the punctuation is removed, the field data are respectively input into the upper cursor point model to obtain the field data with upper cursor point output, and meanwhile, the segmentation marking is needed to be carried out manually, so that the segmentation manual marking data of the manual marking are obtained.

And counting the types of punctuation points appearing in front of the manual segmentation mark by combining the segmentation manual marking data and the data output by the cursor point model to form a sentence ending symbol set, wherein the sentence ending symbol set is used for dividing an input text into sentences and constructing training data of punctuation point information and segmentation information in the field.

Whether the universal segmentation corpus is needed to be used for the first round of fine tuning training based on the pre-training model is selected according to service requirements, and generally, if the trained model is a special service model in a specific field, the training can be selected not to be performed on the universal segmentation corpus, otherwise, the training is performed on the universal segmentation corpus by default. After the pre-training is completed, the domain segmentation data is used to perform a second round of fine-tuning training based on the pre-training model, and this part is already described in steps S11 and S12, and will not be described here again.

After the paragraph segmentation model is trained, the paragraph segmentation model can be used, large paragraph text input by a user is received, and paragraph segmentation is performed based on the paragraph segmentation model. Punctuation of the end sentence of a segmented paragraph is not a conventional end punctuation (e.g., comma) of a period, question mark, etc. These end marks that do not belong to the convention are uniformly modified into periods. The segmented text accords with the punctuation use rule, and the segmented text is fed back to the user.

According to the embodiment, aiming at the problem of sensitivity of the output of the cursor point model, the segmentation information and the punctuation output of the upstream are combined to construct new segmentation training data, and the punctuation quantity distribution before segmentation marking is counted to introduce new sentence segmentation punctuation. For example, the statistics result shows that commas are also present in a large number before the segmentation marks except punctuation of the end of the table such as period, so that we use commas as sentence segmentation punctuation, and train and predict sentences divided according to the new segmentation punctuation set. If the final prediction requires segmentation at a comma, we modify the comma to a period and wrap the segments. Therefore, the robustness of the segmentation model is improved, the dependence of the model on an upstream punctuation is reduced, and meanwhile, the output of the upstream punctuation can be corrected.

The method was tested and objectively evaluated (F1 value):

Paragraph segmentation model not trained using this method: 24

Paragraph segmentation model trained by the method: 94

Subjective evaluation (manual scoring, score 42 full):

paragraph segmentation model not trained using this method: 20.3

Paragraph segmentation model trained by the method: 33.3

Conclusion: the segmentation quality can be obviously improved, and only a small amount of annotation corpus is needed.

Partial decoupling of paragraph segmentation model from punctuation model:

As can be seen from fig. 3, the segment model, which is not decoupled from the upper cursor point model, is greatly affected by the punctuation output upstream, and its F1 value at the segment drops sharply from 88 to 36 when the punctuation is changed from manual punctuation to systematic punctuation.

The segment model of the partial decoupling scheme is adopted, the performance is stable, and the F1 values of the manual punctuation and the system punctuation at the segment are 92 and 94 respectively.

Conclusion: the partial decoupling scheme can remarkably improve the robustness of the model, and can bring improvement benefits of the segmentation quality.

The more advanced effect is shown in the section model shown in fig. 4, and the error correction effect evaluation of the punctuation model results is concluded: the method and the device can additionally improve the performance of the punctuation model, and further improve the experience of a user in reading the text.

On the other hand, complete decoupling is as our alternative:

Slicing the text to obtain a plurality of sliced texts;

Judging whether the sliced text needs to be segmented or not based on a paragraph segmentation model;

And if the segmentation is needed, inputting the sliced text into an upstream punctuation model, and determining the position of the sliced text segmentation based on the output of the upstream punctuation model.

In this embodiment, the principle of the complete structure is as follows: the input of the segment model is consistent with the input of the punctuation model. And slicing a certain text to be segmented according to the size of a certain fixed window, and then predicting whether a certain piece needs segmentation or not. And combining the output results of the punctuation models in each window to determine the position of the specific segment.

Further, if the window size is selected appropriately, the number of punctuation points at the end of each slice may be equal to or less than one according to the result of the punctuation model, so that the position of a specific segment may be determined more accurately.

According to the embodiment, the paragraph segmentation mode is completely decoupled from an upstream punctuation model, namely the segmentation result is completely determined by the paragraph segmentation model, the upstream punctuation model only provides specific positions of the segmentation if the segmentation is needed, namely the segmentation result is not affected by the upstream punctuation at all, and the robustness of the model is remarkably improved.

Fig. 5 is a schematic structural diagram of a training system for a segment segmentation model according to an embodiment of the present invention, where the training system may execute the training method for a segment segmentation model according to any of the foregoing embodiments and is configured in a terminal.

The training system 10 of the paragraph segmentation model provided in this embodiment includes: a model pre-training program module 11 and a segmentation model training program module 12.

The model pre-training program module 11 is configured to pre-train a neural network model of the paragraph segmentation model by using general segmentation data; the segmentation model training program module 12 is configured to train the coding layer related to feature extraction in the pre-trained segmentation model based on the domain segmentation data, so as to obtain a segmentation model adapted to the domain.

Further, the coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction in the paragraph segmentation model in the adaptation field are shared and are used for learning and extracting lexical, syntactic and grammatical features.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the training method of the paragraph segmentation model in any method embodiment;

as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer readable storage medium, it may be used to store a non-volatile software program, a non-volatile computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the training method of the paragraph segmentation model in any of the method embodiments described above.

The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium may optionally include memory remotely located relative to the processor, which may be connected to the apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment of the invention also provides electronic equipment, which comprises: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the training method of the paragraph segmentation model of any of the embodiments of the invention.

The electronic device of the embodiments of the present application exists in a variety of forms including, but not limited to:

(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.

(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc., such as tablet computers.

(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.

(4) Other electronic devices with data processing functions.

The terms "comprises," "comprising," or any other variation thereof, herein are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of training a paragraph segmentation model, comprising:

performing punctuation removal processing on the original field data, and inputting the original field data into an upper cursor point model to obtain segmented field punctuation data;

Generating field segmentation data with punctuation information and segmentation information based on the field punctuation data and the manual field punctuation data;

2. The method of claim 1, wherein the feature extraction-related coding layers in the neural network model of the paragraph segmentation model and the feature extraction-related coding layers of the paragraph segmentation model of the adaptation field are shared for learning to extract lexical, syntactic, grammatical features.

3. The method of claim 1, wherein the neural network model comprises a BERT model.

4. A method according to any of claims 1-3, wherein the data volume of the domain segment data is smaller than the data volume of the generic segment data.

5. A training system for a paragraph segmentation model, comprising:

The segmentation model training program module is used for performing punctuation removal processing on the original field data, inputting the original field data into an upper cursor point model to obtain segmented field punctuation data, receiving segmented manual marking data marked on the original field data by manpower, determining a sentence ending symbol set based on the punctuation type in the segmented manual marking data, segmenting the original field data to obtain segmented manual field punctuation data, generating field segmentation data with punctuation information and segmentation information based on the field punctuation data and the manual field punctuation data, and training a coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain a paragraph segmentation model of an adaptation field.

6. The system of claim 5, wherein the feature extraction-related coding layers in the neural network model of the paragraph segmentation model and the feature extraction-related coding layers of the paragraph segmentation model of the adaptation field are shared for learning to extract lexical, syntactic, grammatical features.

7. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-4.

8. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1-4.