CN113240485A

CN113240485A - Training method of text generation model, and text generation method and device

Info

Publication number: CN113240485A
Application number: CN202110506426.4A
Authority: CN
Inventors: 王艳花; 刘朋樟
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-08-10

Abstract

The application provides a training method of a text generation model, a text generation method and a device, wherein the training method comprises the following steps: extracting candidate selling points from the target text according to a preset Chinese language model, constructing a selling point candidate set according to the candidate selling points, selecting target selling point phrases from the target text, and training a preset text generation model according to the selling point candidate set and the target selling point phrases. In the technical scheme, phrase extraction is carried out on a target text by using a preset Chinese language model in advance to obtain a candidate selling point candidate set, a training data pair is formed by the candidate selling point candidate set and the target selling point phrase, a preset text generation model is trained without using a large amount of training data, and the preset text generation model obtained by training does not have illusion and has a good output effect.

Description

Training method of text generation model, and text generation method and device

Technical Field

The application relates to the technical field of machine learning, in particular to a training method of a text generation model, a text generation method and a text generation device.

Background

Usually, a user can browse article information through a display interface of the terminal device, and in order to facilitate the user to quickly know articles, some key phrases describing characteristics or core attributes of the articles, such as selling points, are extracted from detailed introduction texts of the articles and are displayed on the display interface. The user can quickly learn about the item by browsing the key phrases.

In the prior art, an automatic extraction method of key phrases mainly trains a model by using massive key phrases as training corpora, then inputs long article introduction texts into the trained model, and completes automatic extraction through the model to obtain the key phrases.

However, in the model used in the prior art, when the model is trained, a large amount of training corpora are required to enable the model to have good performance, and the training corpora are obtained from the existing knowledge database, so that the trained model is difficult to extract new key phrases, and the trained model has poor extraction effect on the key phrases.

Disclosure of Invention

The application provides a training method of a text generation model, a text generation method and a text generation device, which are used for solving the problem that the output effect of the existing text generation model is poor.

In a first aspect, an embodiment of the present application provides a method for training a text generation model, including:

extracting candidate selling points from the target text according to a preset Chinese language model, wherein the preset Chinese language model is obtained by utilizing data training of a preset selling point knowledge base, and the candidate selling points are used for describing first characteristics of the article;

constructing a selling point candidate set according to the candidate selling points;

selecting a target selling point phrase from the target text;

and training a preset text generation model according to the selling point candidate set and the target selling point phrase.

In a possible design of the first aspect, the extracting candidate selling points from the target text according to a preset chinese language model includes:

the target text is divided into sentences to obtain candidate phrases;

scoring the candidate phrases by using the preset Chinese language model to obtain scores of the candidate phrases;

and acquiring candidate selling points from the candidate phrases according to the scores of the candidate phrases and a preset score threshold value.

In another possible design of the first aspect, the scoring the candidate phrases by using the preset chinese language model to obtain scores of the candidate phrases includes:

constructing unary participles, binary participles and ternary participles of the candidate phrases;

calculating the scores of the unary participles, the binary participles and the ternary participles by using the preset Chinese language model;

and determining the scores of the candidate phrases according to the scores of the unary participles, the binary participles and the ternary participles.

In yet another possible design of the first aspect, the selecting a target selling point phrase from the target text includes:

performing text classification on the target text by using a preset model, and determining whether phrases in the target text contain selling point words, wherein the selling point words are words for describing characteristics of the article;

and if the phrase contains a selling point word, taking the phrase as the target selling point phrase.

In yet another possible design of the first aspect, before the extracting candidate selling points from the target text according to the preset chinese language model, the method further includes:

acquiring preset selling point knowledge base data, wherein the preset selling point knowledge base data comprise selling point words;

and training according to preset selling point knowledge base data to obtain the preset Chinese language model.

In another possible design of the first aspect, the training to obtain the preset chinese language model according to preset selling point knowledge base data includes:

performing word segmentation, stop word processing and symbol filtering on the preset selling point knowledge base data to obtain training data;

and training to obtain the preset Chinese language model according to the training data.

In yet another possible design of the first aspect, the constructing a candidate set of selling points according to the candidate selling points includes:

acquiring a labeling selling point of the target text, wherein the labeling selling point is used for describing a second characteristic of an article, and the first characteristic is different from the second characteristic;

and constructing a selling point candidate set according to the marked selling points and the candidate selling points.

In a second aspect, an embodiment of the present application provides a text generation method, including:

acquiring a text to be extracted, and extracting a selling point phrase from the text to be extracted by using a preset text generation model, wherein the preset text generation model is obtained by using a selling point candidate set and a target selling point phrase training;

calculating the similarity between the phrases of the selling points;

and merging all the selling point phrases according to the similarity to obtain a target phrase.

In a possible design of the second aspect, the extracting, by using the preset text generation model, a point phrase from the text to be extracted includes:

the text to be extracted is divided into sentences to obtain phrases to be extracted;

determining whether the phrases to be extracted contain selling point words, wherein the selling point words are words for describing characteristics of the articles;

if the phrases to be extracted contain selling point words, extracting the phrases to be extracted to obtain the selling point phrases by utilizing the preset text generation model;

and if the phrases to be extracted do not contain the selling point words, screening out the phrases to be extracted.

In another possible design of the second aspect, if a phrase to be extracted includes a selling point word, extracting, by using the preset text generation model, the selling point phrase from the phrase to be extracted includes:

if the phrases to be extracted contain selling point words, performing word segmentation, stop word processing and symbol filtering on the phrases to be extracted to obtain input texts;

and inputting the input text into the preset text generation model, and extracting a bar selling point phrase from the input text by using the preset text generation model.

In a third aspect, an embodiment of the present application provides a training apparatus for a text generation model, including:

the extraction module is used for extracting candidate selling points from the target text according to a preset Chinese language model, the preset Chinese language model is obtained by utilizing data training of a preset selling point knowledge base, and the candidate selling points are used for describing first characteristics of the article;

the building module is used for building a selling point candidate set according to the candidate selling points;

the selecting module is used for selecting target selling point phrases from the target texts;

and the training module is used for training a preset text generation model according to the selling point candidate set and the target selling point phrase.

In a fourth aspect, an embodiment of the present application provides a text generation apparatus, including:

the acquisition module is used for acquiring a text to be extracted, extracting a selling point phrase from the text to be extracted by using a preset text generation model, wherein the preset text generation model is obtained by using a selling point candidate set and a target selling point phrase training;

the calculating module is used for calculating the similarity between the selling point phrases;

and the merging module is used for merging all the selling point phrases according to the similarity to obtain the target phrase.

In a fifth aspect, embodiments of the present application provide a computer device, including a memory and at least one processor;

the memory stores computer-executable instructions;

the at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform the method as described above.

In a sixth aspect, the present application provides a computer-readable storage medium, in which computer instructions are stored, and when executed by a processor, the computer instructions are used to implement the method described above.

According to the training method and the device for the text generation model, phrase extraction is carried out on the target text by using the preset Chinese language model in advance to obtain the selling point candidate set, the selling points corresponding to the selling point candidate set and the target selling point phrases containing the selling points in the target text form the training data pair, the preset text generation model is trained, a large amount of training data are not needed, the illusion does not occur on the trained preset text generation model, and the better output effect is achieved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application;

fig. 1 is a schematic view of a display interface of a terminal device according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a first embodiment of a training method for a text generation model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a network structure of a preset text generation model;

fig. 4 is a schematic flowchart of a second embodiment of a training method for a text generation model according to the present application;

fig. 5 is a schematic flowchart of a text generation method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a training apparatus for generating a model from a text according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text generation apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms referred to in this application are explained first:

N-gram：

the Chinese language model (N-gram) is an algorithm based on a statistical language model, and can score phrases in a data source to obtain a probability value of each phrase in the data source.

Lasertagger model:

the Laservtagger model is a text generation model, and can deduce a series of editing operations, which mainly include deletion operations, retention operations, addition operations, and the like, and convert a source text into a target text.

Fig. 1 is a schematic view of a display interface of a terminal device according to an embodiment of the present application, as shown in fig. 1, in a terminal device having a display interface 10, such as a cell phone, when a user browses items on the cell phone, a picture of the article and a selling point of the article can be displayed on the display interface 10, the selling point of the first article in the display interface 10 is a high definition screen and a network class business learning, the second article has no selling point, wherein, the selling point is the text description of the article, and because of the limitation of the size of the display interface, the selling point can only be as brief and refined as possible, and the characteristics or the core attribute of the article can be fully expressed, thereby being convenient for the user to more intuitively and accurately know the article, in some online trading platforms, the selling points of the articles are displayed, so that the trading rate of the articles can be improved to a certain extent.

In actual life, corresponding selling points of any different articles are different, a general selling point is provided by an article belonging party or a third party, the obtaining of the selling point needs to be extracted from a long article introduction text, if the selling point is extracted manually, very large labor cost and time cost are undoubtedly consumed, and when the types of the articles are more, the articles also need to be classified, for example, the articles with the same color and different models are divided into two types, different selling points are respectively used, the articles with the same model and different colors are summarized into the same type, the same selling point can be used, and at present, the following modes are mainly provided in the prior art for automatically extracting the selling point of the article from the long article introduction text:

1. the method comprises the steps of performing selling point generation based on an N-gram language model, mainly performing N-gram language model training aiming at phrase combinations in an existing selling point knowledge base, scoring phrases of other data sources by the trained language model to obtain probability values of the phrases becoming the selling point phrases, and performing low-quality commodity selling point filtering by setting a threshold value mode to obtain candidate phrases of commodities.

2. The method comprises the steps of generating a text based on a sequence-to-sequence generation model of a seq2seq series, coding an input sequence into a middle vector by using a multilayer Long-Short-Term Memory (LSTM) in the seq2seq model, decoding the middle vector into a target sequence by using another LSTM, and performing a large amount of training data, slow reasoning speed, uncontrollable reasoning result and post-processing when the seq2seq model is trained.

In summary, there is no better text extraction method in the prior art, which can not only avoid using a large amount of training corpora, but also enable the trained model to extract a target text with better quality from the source text.

Aiming at the texts, the embodiment of the application provides a training method of a text generation model, a text generation method and a text generation device, a Chinese language model is used for extracting texts of target texts to obtain candidate selling points with high quality, then candidate selling points are used for constructing a selling point candidate set, a preset lasterger model is trained by using the selling point candidate set and the target selling point phrases, on one hand, the difficulty in obtaining training linguistic data can be reduced, on the other hand, the trained lasterger model has a good output effect, accurate extraction of the selling point phrases is achieved, manual extraction of the selling point phrases is avoided, the extraction effect is improved, and labor cost is avoided.

The technical solution of the present application will be described in detail below with reference to specific examples. It should be noted that the following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 2 is a schematic flowchart of a first embodiment of a training method for a text generation model according to an embodiment of the present application. The method can be applied to a processing device such as a computer. As shown in fig. 2, the method may include the steps of:

s201, extracting candidate selling points from the target text according to a preset Chinese language model.

The preset Chinese language model can be an N-gram language model, and the N-gram language model can be obtained by utilizing the fact that the preset selling point knowledge base data comprises massive selling point phrases for training. The candidate selling points are used for describing a first feature of the article, and the first feature may refer to a core attribute of the article or a characteristic of the article, for example, the core attribute of a notebook computer is a high definition screen.

Specifically, the target text includes a reach article, a comment of an article, and the like, and taking the target text as the reach article as an example, the reach article may be a long text description used for describing details of the article, and candidate selling points extracted from the target text by a preset chinese language model include relatively few texts, and exemplarily, the candidate selling points include 3 to 7 words, for example, "web lesson business learning" includes 6 words in total.

In this embodiment, a target text is input into a preset chinese language model, the target text may be divided into a plurality of phrases, each phrase in the target text is scored through the preset chinese language model, a probability score of each phrase is output, and a phrase having a probability satisfying a set threshold is selected from the phrases according to the probability score of each phrase, and is used as a candidate selling point.

For example, the number of candidate selling points may be adjusted according to the size of the set threshold, for example, when the set threshold is lower, a plurality of candidate selling points may be obtained, and when the set threshold is higher, the number of candidate selling points obtained is smaller.

S202, constructing a selling point candidate set according to the candidate selling points.

In this embodiment, the candidate set of selling points includes a plurality of candidate selling points, and the candidate set of selling points may be used as target data for training a preset text generation model.

Illustratively, the selling point candidate set can be added with labeled selling points, when a target text is extracted based on a preset Chinese language model, the preset Chinese language model is obtained by training based on the existing preset selling point knowledge base data, if the target text comprises new selling points, the preset Chinese language model can not be extracted, and the new selling points are labeled to be used as labeled selling points and are jointly constructed with the candidate selling points to obtain the selling point candidate set.

S203, selecting the target selling point phrase from the target text.

In this embodiment, the selling point candidate set is used as target data, the target selling point phrase is used as source data, and a training data pair is formed and used for training the preset text generation model.

For example, a fasttext model may be used to classify the text, determine whether a phrase in the target text contains a selling point word, and if so, select the phrase as the target selling point phrase.

And S204, training a preset text generation model according to the selling point candidate set and the target selling point phrase.

Specifically, the target selling point phrase is used as input data and input into a preset text generation model, and the selling point candidate set is used as a true value and used for comparison and verification of an output value of the preset text generation model.

For example, the target selling point phrase may be "powerful indoor machine with sufficient power and capable of quickly heating", and the candidate selling points corresponding to the selling point candidate set may be "powerful and quickly heating".

In this embodiment, the preset text generation model is a lasertager model, the lasertager model is a text generation based on a text editing method, and converts a generation task into a labeling task, where the text marking is to delete, reserve, or add a sequence labeling model to a character. The core idea of the lastergger model is: the method comprises the steps of labeling sequence characters in input source text, endowing each character with a labeling label, dividing the labeling label into a basic label, an additional label and a switching label, wherein the basic label is used for deleting or retaining the character, the additional label is inserted into a character which does not exist in the source text, such as punctuation marks or indication pronouns (such as s, s and the like), and the switching label is used for switching the characters or sentences in the source text.

Illustratively, fig. 3 is a schematic diagram of a network structure of a preset text generation model, the Laservtagger model is divided into two parts, one part is an encoder, the encoder adopts a BERT-base architecture as an embedding layer 301, and includes 12 self-orientation layers 302 to construct input hidden vectors, and the other part is a decoder, the decoder uses a single-layer transform 311 above the BERT, and the embedding of the generated first n-1 words is added when predicting the labeling label of the nth character.

According to the embodiment of the application, the N-gram model is used for extracting high-quality candidate selling points from the target text, the candidate selling points and the target selling point phrases are used for constructing the training data pairs, the preset Laertagger model is trained, the difficulty in obtaining the training corpus can be reduced, a large amount of training data is not needed, and meanwhile, when the Laertagger model obtained through training generates the target text, the illusion is not easy to appear, the prediction speed is high, a good output effect is achieved, and the text generation efficiency is improved.

For example, on the basis of the above embodiments, in some embodiments, the step S201 may be specifically implemented by the following steps:

sentence dividing is carried out on the target text to obtain candidate phrases;

scoring the candidate phrases by using a preset Chinese language model to obtain scores of the candidate phrases;

Specifically, the target text is a long text and is composed of a plurality of sentences, the target text can be divided into sentences according to punctuations of each sentence to obtain a plurality of sentences, and the sentences are edited to finally obtain a plurality of candidate phrases.

Illustratively, the preset Chinese language model is an N-gram model, and the score of the intelligent speech control is calculated by taking the intelligent speech control as a candidate phrase through the following formula:

p ("smart voice control") > P (smart) P (control) P (| voice).

In the above formula, P ("smart voice control") represents the score of the candidate phrase "smart voice control", P (smart "| < s >) represents the total number of all candidate phrases/candidate phrases beginning with the word" smart ", P (smart" |) represents the number of simultaneous occurrences of two words "voice" and "smart"/"smart" -, and so on, and P (</s > | control) represents the total number of all candidate phrases/candidate phrases ending with the word "control".

Optionally, the preset score threshold may be set according to actual conditions, and the higher the score of each candidate phrase is, the higher the probability that the candidate phrase is a candidate selling point is.

According to the embodiment of the application, the candidate phrases in the target text are scored by using the N-gram model, the candidate selling points with higher scores are selected from the candidate phrases according to the scores of the candidate phrases and used as high-quality training data, a large amount of training data is not needed, and the difficulty in acquiring the training data is reduced.

Further, on the basis of the foregoing embodiments, in some embodiments, the "scoring the candidate phrases by using a preset chinese language model to obtain scores of the candidate phrases" may specifically be implemented by the following steps:

calculating scores of unary participles, binary participles and ternary participles by using a preset Chinese language model;

In this embodiment, taking "intelligent voice control" as an example of a candidate phrase, the constructed unary participles are "intelligent, energy, voice control, and control", the binary participles are "intelligent voice, voice control", the ternary participles are "intelligent voice control", and the scores of the candidate phrases are finally obtained by determining the scores of the unary participles, the binary participles, and the ternary participles.

According to the embodiment of the application, the scores of the candidate phrases are determined by determining the scores of the unary participles, the binary participles and the ternary participles of the candidate phrases, so that the score accuracy of the candidate phrases can be improved, and the candidate phrases with higher quality can be obtained.

Fig. 4 is a schematic flow diagram of a second embodiment of a training method for a text generation model provided in the embodiment of the present application, and as shown in fig. 4, the method specifically includes the following steps:

s401, extracting candidate selling points from the target text according to a preset Chinese language model.

S402, constructing a selling point candidate set according to the candidate selling points.

And S403, selecting the target selling point phrase from the target text.

S404, performing text classification on the target text by using a preset model, and determining whether phrases in the target text contain the selling point words.

S405, if the phrase contains the selling point words, the phrase is used as the target selling point phrase.

S406, training a preset text generation model according to the selling point candidate set and the target selling point phrase.

For the above steps S401 to S403 and step S406, refer to the explanation of steps S201 to S204 in the above embodiment, which is not described herein again, and this embodiment mainly describes steps S404 to S405.

Wherein, the selling point words are words for describing the characteristics of the articles.

In this embodiment, when the preset text generation model is trained, the training data set is used as a given input text, the preset text generation model outputs a generation result directly according to the input text, the generation result is obtained from the input text, and if no selling point word is included in the input text, the generation result does not include the selling point word.

For example, a fast text classification model (e.g., fasttext model) may be used to classify the phrases in the target text two times, and determine whether the phrases contain the selling point words.

According to the embodiment of the application, phrases in the target text are judged, and the phrases which do not contain the selling point words are screened out of the training data set, so that the quality of the training data set is better, and the subsequent training effect on the preset text generation model is improved.

Optionally, on the basis of the foregoing embodiment, in some embodiments, before the foregoing step S201, the following step may be further included:

and acquiring preset selling point knowledge base data.

And training according to the preset selling point knowledge base data to obtain a preset Chinese language model.

Wherein, the preset selling point knowledge base data comprises selling point words. Illustratively, the preset selling point knowledge database stores a large amount of existing selling point words.

In this embodiment, the predetermined Chinese language model is an N-gram language model.

According to the embodiment of the application, the N-gram model is trained by using the existing preset selling point knowledge base data, and the training data are derived from the existing preset selling point knowledge base data, so that the acquisition difficulty of the training data can be reduced, the selling point words do not need to be manually extracted to serve as the training data, and the labor cost is reduced.

Further, on the basis of the above embodiments, in some embodiments, the step "training to obtain the preset chinese language model according to the preset selling point knowledge base data" may be specifically implemented by the following steps:

performing word segmentation, stop word processing and symbol filtering on preset selling point knowledge base data to obtain training data;

and training according to the training data to obtain a preset Chinese language model.

In this embodiment, the preset selling point knowledge base data is used as an original database, and the selling point words are obtained by extracting detailed introduction texts of various articles.

Illustratively, word segmentation means that a phrase is divided into a plurality of words, for example, the phrase is "I am a student", and after word segmentation, the obtained words are "I", "one", "student".

The stop word filtering means filtering out some common words, widely used words and words with little meaning in a preset selling point knowledge database, and exemplarily, the stop words include mood assist words, adverbs, prepositions, conjunctions and the like, such as "at", "and", "yes" and the like.

Symbol filtering refers to filtering some punctuation marks (e.g., commas, periods, etc.) in a preset selling point database.

The method and the device have the advantages that the word segmentation, stop word processing and symbol filtering are carried out on the data of the preset selling point knowledge base, the processed data are used as training data to train the N-gram model, and the training effect of the model can be improved.

For example, on the basis of the above embodiments, in some embodiments, the step S202 may be specifically implemented by the following steps:

and acquiring a marked selling point of the target text.

Wherein the label selling point is used for describing a second characteristic of the article, and the first characteristic is different from the second characteristic.

In this embodiment, the labeled selling points may be phrases different from the candidate selling points, and after the candidate phrases of the selling points are extracted from the target text by using the N-gram model, the target text may be labeled by using a human or the like, some selling points that cannot be extracted by the N-gram model are labeled, and the labeled selling points are used as labeled selling points.

According to the method and the device, the selling point candidate phrases are constructed by the marked selling points and the selling point candidate phrases together, so that the cost for manually marking the selling points of the target text is reduced, and the acquisition efficiency of the selling point candidate set is improved.

Fig. 5 is a schematic flowchart of a text generation method provided in an embodiment of the present application, where the method may be applied to a terminal device such as a computer, as shown in fig. 5, the method includes the following steps:

s501, obtaining a text to be extracted, and extracting a selling point phrase from the text to be extracted by utilizing a preset text generation model.

The preset text generation model is obtained by utilizing a selling point candidate set and a target selling point phrase training.

In this embodiment, the preset text generation model may be the above-mentioned lasertagger model, and the text to be extracted is input into the model and edited to obtain a plurality of selling point phrases, such as two selling point phrases of "web class business learning" and "high definition screen".

As an example, taking "the air conditioner can independently dehumidify and drive damp air, and the rainy season is not in worry about indoor wet feeling" as a text to be extracted, "independent dehumidification" is used as a selling point phrase extracted by using the trained lasertagger model, and "rapid heating" is taken as a text to be extracted by carrying a powerful internal machine with sufficient power, and "rapid heating" is used as a selling point phrase extracted by using the trained lasertagger model.

For example, taking english "Dylan won Nobel price, Dylan is a American music, as an example, the lasertagger model performs corresponding editing operations on the input text, and finally outputs a text result, and the process of the specific editing operations is shown in the following table:

in the above table, Source indicates that the input text is "Dylan won label, Dylan is a American music specialty", and Tags indicates a label tag corresponding to each character in the input text, where DELETE indicates that the character is deleted, KEEP indicates that the character is reserved, SWAP indicates that the characters are swapped, and the character swapping order is performed. Finally, the resulting interpretation indicates the output text result "Dylan, an American music, won Nobel prize".

S502, calculating the similarity between the selling point phrases.

In this embodiment, the similarity calculation formula between selling points may be as follows:

S＝(S1+S2+S3+S4)/4。

in the above formula, S represents similarity, and S1 represents Levenshtein ratio distance (Levenshtein distance), which refers to the minimum number of editing operations required between two character strings for conversion from one to another, the editing operations including replacing one character with another, inserting one character, and deleting one character. S2 denotes the Jaro-Winkler distance, S3 denotes the longest common substring, and S4 denotes the edit distance.

The calculation formula of S1 is as follows:

S1＝(sum-ldist)/sum

in the above formula, sum is the total length of the two character strings of a and b, and ldist is the edit distance, and in the class edit distance, one character is deleted, one character is inserted with 1, and one character is replaced with the other character with 2.

The calculation formula of S2 is as follows:

S2＝d_j+lp(1-d_j)

in the above formula, s1 and s2 are two character strings to be compared, m is the matching length of s1 and s2, t is the number of transpositions, d_jRepresents the score, p represents a factor that adjusts the prefix match, and l represents the length of the prefix match.

The transposition number t needs to be determined according to the matching window value, when the distance between the two characters is smaller than the matching window value, the two characters are considered to be matched, and transposition is needed if the positions are different. The matching window calculation formula is as follows:

illustratively, for a set of strings AECFR and AMECFDR, MW 2.5 and m 5, the matching string a-E-C-F-R has an order in both strings, so no transposition is required, t 0.

S3 can be calculated by the following formula: s3 ═ L2/avg (| S1|, | S2|)

In the above formula, s1 and s2 are two character strings to be compared, and L2 represents the length of the common substring.

S4 can be calculated by the following formula: s4 ═ sum-ldist)/sum.

And S503, merging the phrases of the selling points according to the similarity to obtain a target phrase.

Specifically, the selling point phrases with higher similarity are combined to form a phrase, and after all the selling point phrases with higher similarity are combined, a target phrase, namely the selling point phrase describing the characteristics of the article, is finally obtained.

According to the method and the device, the target text is generated by using the Lafertager model, the obtained target text comes from the text to be extracted, the controllability is strong, the illusion does not appear in the generated result, the prediction speed is high, the adaptability is strong, and the output effect is strong.

For example, on the basis of the above embodiments, in some embodiments, the step S501 may be specifically implemented by the following steps:

the method comprises the steps of performing sentence division on a text to be extracted to obtain a phrase to be extracted;

determining whether the phrases to be extracted contain selling point words or not;

if the phrases to be extracted contain the selling point words, extracting the selling point phrases from the phrases to be extracted by utilizing a preset text generation model;

Wherein, the selling point words are words for describing the characteristics of the article.

In this embodiment, the text to be extracted may be divided into sentences according to punctuations of the text to be extracted, so as to obtain a plurality of phrases to be extracted.

Exemplarily, a fasttext model can be used for text classification, whether a phrase to be extracted contains a selling point word or not is judged, if yes, the phrase is directly input into a preset text generation model, and otherwise, the phrase is not input into the preset text generation model.

The method and the device for extracting the phrases have the advantages that the phrases to be extracted are divided, whether the phrases to be extracted contain the selling point words or not is determined, the output effect of the preset text generation model can be improved, and the situation that the generated target text does not contain the selling point words is avoided.

Further, on the basis of the above embodiments, in some embodiments, the step "if a phrase to be extracted includes a selling point word, extracting the selling point phrase from the phrase to be extracted by using a preset text generation model" may specifically be implemented by the following steps:

if the phrases to be extracted contain the selling point words, performing word segmentation, stop word processing and symbol filtering on the phrases to be extracted to obtain input texts;

and inputting the input text into a preset text generation model, and extracting the selling point phrase from the input text by using the preset text generation model.

In this embodiment, word segmentation means that a phrase is divided into a plurality of words, for example, the phrase is "i am a student", and after word segmentation, the obtained words are "i", "is", "one", and "student". Stop word filtering refers to filtering out some manually input and non-automatically generated words in a preset selling point knowledge database. Symbol filtering refers to filtering some punctuation marks (e.g., commas, periods, etc.) in a preset selling point database.

The input text is obtained by performing word segmentation, stop word processing and symbol filtering on the phrases to be extracted, so that the preset text generation model can output a better target text, and the generation quality of the selling point phrases is improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 6 is a schematic structural diagram of a training apparatus for a text-generating model according to an embodiment of the present application, which may be integrated on a computer or independent from the computer, and cooperates with the computer to perform the above method, as shown in fig. 6, the training apparatus 60 for a text-generating model includes an extracting module 61, a constructing module 62, a selecting module 63, and a training module 64.

The extraction module 61 is configured to extract candidate selling points from the target text according to a preset chinese language model. The construction module 62 is configured to construct a candidate set of selling points according to the candidate selling points. The selection module 63 is used to select a target selling point phrase from the target text. The training module 64 is configured to train the preset text generation model according to the candidate set of selling points and the target selling point phrase.

The preset Chinese language model is obtained by utilizing preset selling point knowledge base data training, and the candidate selling points are used for describing first characteristics of the articles.

In some embodiments, the extracting module 61 is specifically configured to:

Optionally, in some embodiments, the extracting module 61 is specifically configured to:

In some embodiments, the training apparatus for generating a text model further includes a filtering module, configured to:

performing text classification on the target text by using a preset model, and determining whether phrases in the target text contain selling point words or not;

and if the phrase contains the selling point words, taking the phrase as the target selling point phrase.

In some embodiments, the training device 60 for generating a text model further includes: a data acquisition module to:

In some embodiments, the data obtaining module is specifically configured to:

In some embodiments, the building module 62 is specifically configured to:

acquiring a marked selling point of a target text;

The device provided by the embodiment of the application can be used for executing the training method of the text generation model in the above embodiment, and the implementation principle and the technical effect are similar, and are not described again here.

Fig. 7 is a schematic structural diagram of a text generating apparatus according to an embodiment of the present application, and as shown in fig. 7, the text generating apparatus 70 includes: an acquisition module 71, a calculation module 72 and a merging module 73.

The obtaining module 71 is configured to obtain a text to be extracted, and extract a selling point phrase from the text to be extracted by using a preset text generation model. The calculating module 72 is used for calculating the similarity between the selling point phrases. The merging module 73 is configured to merge the selling point phrases according to the similarity to obtain a target phrase.

In some embodiments, the obtaining module 71 is specifically configured to:

Optionally, in some embodiments, the obtaining module 71 is specifically configured to:

the input text is input into a preset text generation model, and a bar selling point phrase is extracted from the input text by using the preset text generation model.

The apparatus provided in the embodiment of the present application may be configured to execute the text generation method in the foregoing embodiment, and the implementation principle and the technical effect are similar, which are not described herein again.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the determining module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and the function of the determining module is called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element here may be an integrated circuit with signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

Fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application, and as shown in fig. 8, the server includes a memory 81 and at least one processor 82, the memory 81 stores computer execution instructions; the at least one processor 82 executes computer-executable instructions stored by the memory 81 to cause the at least one processor 82 to perform the methods as described above.

Illustratively, the computer device further comprises a bus 83, wherein the memory 81 is connected to the processor 82 via the bus 83.

For example, the memory 81 may also be used to store a preset chinese language model, a trained preset text generation model, and the like, and the memory 81 may be cloud storage or local storage.

Illustratively, the bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

Embodiments of the present application further provide a computer-readable storage medium, in which computer instructions are stored, and when executed by a processor, the computer instructions are used to implement the method as described above.

Embodiments of the present application also provide a computer program product comprising a computer program/instructions, which when executed by a processor, implement the method as above.

In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship; in the formula, the character "/" indicates that the preceding and following related objects are in a relationship of "division". "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for convenience of description and distinction and are not intended to limit the scope of the embodiments of the present application. In the embodiment of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A training method of a text generation model is characterized by comprising the following steps:

selecting a target selling point phrase from the target text;

2. The method of claim 1, wherein extracting candidate selling points from the target text according to a preset Chinese language model comprises:

the target text is divided into sentences to obtain candidate phrases;

3. The method of claim 2, wherein scoring the candidate phrases using the predetermined chinese language model to obtain scores for the candidate phrases comprises:

4. The method of claim 1, wherein selecting a target selling point phrase from the target text comprises:

5. The method of claim 1, wherein before extracting candidate selling points from the target text according to the preset chinese language model, the method further comprises:

6. The method of claim 5, wherein the training to obtain the preset chinese language model according to the preset selling point knowledge base data comprises:

7. The method of claim 1, wherein constructing a candidate set of selling points from the candidate selling points comprises:

8. A text generation method, comprising:

calculating the similarity between the phrases of the selling points;

9. The method according to claim 8, wherein the extracting, by using the preset text generation model, a point phrase from the text to be extracted comprises:

10. The method according to claim 9, wherein if the phrase to be extracted includes a selling point word, extracting the selling point phrase from the phrase to be extracted by using the preset text generation model includes:

11. An apparatus for training a text generation model, comprising:

12. A text generation apparatus, comprising:

13. A computer device comprising a memory and at least one processor;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of any one of claims 1-10.

14. A computer-readable storage medium having stored thereon computer instructions for implementing the method of any one of claims 1-10 when executed by a processor.

15. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the method of any of claims 1-10.