WO2023093372A1 - 生成文本的方法和装置 - Google Patents

生成文本的方法和装置 Download PDF

Info

Publication number
WO2023093372A1
WO2023093372A1 PCT/CN2022/125780 CN2022125780W WO2023093372A1 WO 2023093372 A1 WO2023093372 A1 WO 2023093372A1 CN 2022125780 W CN2022125780 W CN 2022125780W WO 2023093372 A1 WO2023093372 A1 WO 2023093372A1
Authority
WO
WIPO (PCT)
Prior art keywords
data items
data
text
data item
items
Prior art date
Application number
PCT/CN2022/125780
Other languages
English (en)
French (fr)
Inventor
陈家泽
李云昊
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023093372A1 publication Critical patent/WO2023093372A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • Various implementations of the present disclosure relate to the computer field, and more specifically, to a method, device, device, and computer storage medium for generating text.
  • the media will release relevant coverage of the event.
  • relevant information about the event will be introduced, such as the background of the event, the participants, the results of the event and the relevant performance of the athletes. People will want to be the first to get relevant coverage of the event. Briefings on sporting events have traditionally been written by sports journalists. With the development of natural language processing, NLP technology can be used to write relevant reports.
  • the neural network model used to generate event reports generally regards the data items in the data table as "words”, and then serializes these "words", thereby converting the entire table into a word-related sequence. Then train a neural network model from the data table sequence to the event report sequence, and use the trained model to generate the event report.
  • a computer-implemented method includes: acquiring data items in a data table, each data item representing a value of an entity for a corresponding attribute; based on the data item, determining a label corresponding to the data item according to a label classification model, each label indicating a corresponding data item importance of ; and selecting a first set of data items from the data items based on the labels for use in generating a text describing an event associated with the data table.
  • selecting the first group of data items from the data items includes: selecting a data item whose importance is higher than an importance threshold from the data items according to the label.
  • the method further includes: generating text according to a text generation model based on the first set of data items and the entities and attributes represented by the first set of data items, wherein the text contained in the text is associated with the first set of data items
  • the words of are selected from the corpus in a predetermined order by the text generation model.
  • generating the text according to the text generation model includes: selecting a second group of data items from the data items, so that the sum of the number of data items in the second group of data items and the number of data items in the first group of data items be a predetermined value; merge the first set of data items and the second set of data items to generate a merged group; and based on the data items in the merged group, the attributes and entities represented by the data items in the merged group, and the data in the merged group The label corresponding to the item generates text according to the text generation model.
  • generating the text according to the text generation model includes: respectively based on the data items in the merged group, the attributes represented by the data items in the merged group, and the labels corresponding to the data items in the merged group, according to the quantitative model, to determine the quantitative representation for data items, the quantitative representation for attributes, and the quantitative representation for labels based on the same metric; and based on the quantitative representation for data items, the quantitative representation for attributes, the quantitative representation for labels, and the The positions of the data items are sequential, and the text is generated according to the text generation model.
  • a text generation model is trained such that a set of data items appear in text according to rules based on the size of the data values.
  • the rule includes at least one of the following: the data item with the largest numerical value among the data items for the same attribute; the data item with the largest numerical value among the data items for the same attribute; the data item with the largest numerical value among the data items for the same attribute; The data items of the threshold value; the data items whose values are smaller than the first threshold among the data items of the same attribute; the numerical values of the data items of the same attribute are arranged in ascending order; and the values of the data items of the same attribute are arranged in descending order.
  • the training set used for training the text generation model includes training data items in the training data table for events and data items associated with rules in the training data table; wherein the training data items in the training data set are based on The entities they represent are divided into groups, and the data items associated with the rules are included in the training data set as additional groups.
  • the data item includes a character data item and a numerical data item
  • determining the label corresponding to the data item according to the label classification model includes: assigning a corresponding numerical value to the character data item; and based on the numerical data item and The assigned corresponding value determines the label corresponding to the data item according to the label classification model.
  • an apparatus for generating text includes: an acquisition module, configured to acquire data items in the data table, each data item represents the value of an entity for a corresponding attribute; a determination module, configured to determine the relationship between the data and the data according to the label classification model based on the data item a label corresponding to the item, each label indicating the importance of the corresponding data item; and a selection module configured to select a group of data items from the data items based on the label for generating a text describing the data associated with the data table event.
  • the selecting module is further configured to: select, from the data items, the data items whose importance is higher than the importance threshold according to the labels.
  • the device further includes: a generation module configured to generate text according to a text generation model based on the first set of data items and the entities and attributes represented by the first set of data items, wherein the text contained in the text is related to Words associated with the first set of data items are selected in a predetermined order from the corpus by the text generation model.
  • a generation module configured to generate text according to a text generation model based on the first set of data items and the entities and attributes represented by the first set of data items, wherein the text contained in the text is related to Words associated with the first set of data items are selected in a predetermined order from the corpus by the text generation model.
  • the generation module is further configured to: select a second group of data items from the data items, so that the sum of the number of data items in the second group of data items and the number of data items in the first group of data items is a predetermined values; merge the first set of data items and the second set of data items to generate the merged group; and based on the data items in the merged group, the attributes and entities represented by the data items in the merged group, and the correspondence to the data items in the merged group The label of , generates text according to the text generation model.
  • the generating module is further configured to: determine according to the quantitative model based on the data items in the merged group, the attributes represented by the data items in the merged group, and the labels corresponding to the data items in the merged group, respectively.
  • a text generation model is trained such that a set of data items appear in text according to rules based on the size of the data values.
  • the rule includes at least one of the following: the data item with the largest numerical value among the data items for the same attribute; the data item with the largest numerical value among the data items for the same attribute; the data item with the largest numerical value among the data items for the same attribute; The data items of the threshold value; the data items whose values are smaller than the first threshold among the data items of the same attribute; the numerical values of the data items of the same attribute are arranged in ascending order; and the values of the data items of the same attribute are arranged in descending order.
  • the training set used for training the text generation model includes training data items in the training data table for events and data items associated with rules in the training data table, wherein the training data items in the training data set are based on The entities they represent are divided into groups, and the data items associated with the rules are included in the training data set as additional groups.
  • the data items include character data items and numerical data items
  • the determination module is further configured to: assign corresponding numerical values to the character data items; and based on the numerical data items and the assigned corresponding numerical values, according to the label classification model Determine the label corresponding to the data item.
  • an electronic device including: a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of the first aspect.
  • a computer-readable storage medium on which one or more computer instructions are stored, wherein one or more computer instructions are executed by a processor to implement the method according to the first aspect of the present disclosure .
  • a computer program product comprising one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method according to the first aspect of the present disclosure.
  • ligand molecules can be efficiently constructed based on a self-supervised method, thereby improving the universality of the method.
  • Figure 1 shows a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented
  • Figure 2 shows a flowchart of an example method of selecting data items according to some embodiments of the present disclosure
  • FIG. 3 shows a schematic block diagram of an example process of determining labels according to some embodiments of the present disclosure
  • Fig. 4 shows a schematic block diagram of an example process of generating text according to some embodiments of the present disclosure
  • FIG. 5 shows a schematic block diagram of an example overall process of selecting data items and generating this text according to some embodiments of the present disclosure
  • Figure 6 shows a block diagram of an apparatus for determining a response sentence according to some embodiments of the present disclosure.
  • Figure 7 shows a block diagram of a device capable of implementing various embodiments of the present disclosure.
  • model can learn the relationship between the corresponding input and output from the training data, so that the corresponding output can be generated for the given input after the training is completed.
  • the generation of the model may be based on machine learning techniques.
  • Deep learning is a machine learning algorithm that uses multiple layers of processing units to process input and provide corresponding output.
  • a neural network model is an example of a deep learning based model.
  • a “model” may also be referred to herein as a "machine learning model,” “learning model,” “machine learning network,” or “learning network,” and these terms are used interchangeably herein.
  • “parameters of a determined model” or similar expressions refer to the values of parameters of a determined model (also referred to as parameter values), including specific values, value sets, or value ranges.
  • a “neural network” is a machine learning network based on deep learning.
  • a neural network is capable of processing input and providing a corresponding output, which generally includes an input layer and an output layer and one or more hidden layers between the input layer and the output layer.
  • Neural networks used in deep learning applications typically include many hidden layers, increasing the depth of the network.
  • the layers of the neural network are connected in sequence so that the output of the previous layer is provided as the input of the subsequent layer, where the input layer receives the input of the neural network, and the output of the output layer serves as the final output of the neural network.
  • Each layer of a neural network consists of one or more nodes (also called processing nodes or neurons), and each node processes the input from the previous layer.
  • machine learning can roughly include three phases, namely training phase, testing phase and application phase (also known as inference phase).
  • training phase a given model can be trained using a large amount of training data, and the parameter values of the model are updated iteratively until the model can obtain consistent inferences that meet the expected goals from the training data.
  • a model can be thought of as being able to learn associations from inputs to outputs (also known as input-to-output mappings) from the training data.
  • the parameter values of the trained model are determined.
  • test input is applied to the trained model to test whether the model can provide the correct output, thereby determining the performance of the model.
  • the model can be used to process the actual input and determine the corresponding output based on the parameter values obtained by training.
  • the training phase can in turn include pre-training and fine-tuning.
  • Pre-training refers to training the model for common tasks, that is, iteratively updating the parameter values of the model.
  • the pre-trained models have a wide range of applications and can be applied to many different downstream tasks.
  • Fine-tuning refers to training a pre-trained model on the specific downstream task to which the model will be applied. The fine-tuned model is more suitable for specific downstream tasks.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented.
  • computing device 140 receives data table 110 for an event, processes data table 110 to generate text 150 describing the event.
  • the event is a basketball game for the Bulls and Swords.
  • Data table 110 includes a number of entities, such as individual players and two teams.
  • the data item in the data table 110 represents the value of an entity for a certain attribute. For example, an attribute "score" of the entity "Bob" has a value of 22. It should be understood that after the computing device 140 receives the data item in the data table 110, the computing device 140 knows the two-dimensional information corresponding to the data item, that is, the entity and attribute corresponding to the data item.
  • the computing device 140 is provided with a content selection module 120 and a content sorting module 130 .
  • the content selection module 120 is used to screen important information, and the content sorting module 130 is used to make the generated text more logical.
  • the machine learning model used to generate the reports of events as shown in Figure 1 has a content selection module 120 and a content sorting module 130
  • the numerical and non-numerical data items in the data table 110 are all regarded as "words", that is, characters
  • words that is, characters
  • the numerical information of the data items in the data table is lost.
  • a player's name would be mapped to an integer ID, eg "Bob” would be mapped to "1”.
  • his score "22" would be mapped to another ID such as "2”.
  • These IDs are converted into a vector or matrix by the word embedding layer in the machine learning model. Such a sequence of words is mapped to a sequence of vectors.
  • the resulting sequence of vectors can then be processed using a text generation model.
  • the specific numerical information of the score "22" is lost. Therefore, in the generated text, Jerry, who has better data but may not be well-known, is not selected, but Michael, who starts first but has poor data, is selected.
  • a scheme for selecting data items to be mentioned and for generating text is provided.
  • the data items are processed by using the label classification model, and the labels corresponding to the data items and associated with the importance of the data items are obtained.
  • the data item is selected by label to be used to generate text describing the event.
  • the embodiments of the present disclosure can more logically select the text content to be mentioned based on the importance of the data items themselves, so that This prevents the generated text from missing high-importance content while avoiding including low-importance content.
  • FIG. 2 shows a flowchart of an example method 200 of selecting data items according to some embodiments of the present disclosure.
  • Method 200 may be implemented, for example, by computing device 140 in FIG. 1 or a different computing device.
  • computing device 140 retrieves a data item in a data table.
  • Each data item in the data table represents the value of an entity for the corresponding attribute.
  • a data table may be statistics for individual players in a basketball game.
  • each data item in the data table represents an item of data of a player or team (ie entity), such as the numerical value of the player's score (ie attribute).
  • computing device 140 determines, based on the data item, a tag corresponding to the data item according to the tag classification model.
  • Each label indicates the importance of the corresponding data item. For example, importance is associated with the value of a data item.
  • a data item has high importance if its value is above a predetermined threshold depending on the attribute the data item represents.
  • a data item has high importance if its value is higher than that of other entities for the same attribute.
  • the value of the data item of the entity deviates from the historical data of the entity by more than a predetermined threshold, then the data has high importance.
  • this data item when a player's score is greater than 20, this data item has high importance; when a player gets the highest score in the team, this data item has high importance; This data item also has high importance when a player's score is 20 points higher than his own historical score.
  • the data items in the data table include character data items (for example, positions played by players, such as forwards, guards, and centers, etc.) wait).
  • character data items may be assigned corresponding numerical values prior to processing the data items. For example, "Forward” may be assigned the value "0". Alternatively, in some embodiments, character data items may be discarded. An example process regarding selection of data items will be described in detail below with reference to FIG. 3 .
  • FIG. 3 shows a schematic block diagram of an example process 300 of determining labels according to some embodiments of the present disclosure.
  • Process 300 may be implemented, for example, by computing device 140 in FIG. 1 or a different computing device.
  • the data table contains two types of entities, namely players and teams.
  • the data table includes a player data item part 310 and a team data item part 330 .
  • the character data item is assigned a corresponding numerical value according to a predetermined rule.
  • the player's positions "forward", “center forward” and “defender” are assigned numerical values "0", "2" and "1", respectively.
  • all data items in the player data item part 310 and the team data item part 320 are extracted to obtain a player data item set 320 and a team data item set 330 , and all data items are concatenated into a data vector 350 .
  • the data vector 350 is input into a label classification model 360 to determine a label vector 370 corresponding to the data items of each dimension in the data vector 350 .
  • Each dimension in label vector 370 indicates the importance of its corresponding data item.
  • the label has two values "0" and "1", where "1" indicates that the corresponding data item is to be mentioned, and "0" indicates that the corresponding data item is not to be mentioned .
  • label classification model 360 may be any vector-based classifier model.
  • the label classification model is an XGBoost model.
  • computing device 140 selects a first set of data items from the data items based on the labels for use in generating a piece of text. This text describes the event associated with the data table.
  • computing device 140 may select, from the data items, the data items whose importance is higher than the importance threshold based on the determined tags.
  • the selected set of data items using method 200 may be used to generate text, and in some embodiments, computing device 140 may be based on the selected first set of data items and the text represented by the first set of data items. Entities and attributes to generate text according to the text generation model.
  • the words contained in the generated text are associated with the first group of data items, and are selected from the corpus in a predetermined order by the text generation model.
  • the predetermined order may be an order conforming to natural language rules, so that the generated text is presented in the form of natural language text.
  • the text generation model is a sequence prediction model that is trained to select from a corpus obtained through training the words associated with the input sequence by determining the correlation between sequence items in the input sequence of characters.
  • corpus means a collection of words used to generate text included in the trained text generation model, and is not limited to a specific database.
  • the corpus is included in the text generation model and is obtained when the text generation model is trained.
  • the data sets used during training may have different languages, so as to obtain corpora in different languages, so that the generated text can be presented in different languages.
  • the text is generated by filling words associated with the data item into a preset template.
  • computing device 140 may also select a second set of data items from the data items such that the number of data items in the second set of data items is the same as the number of data items in the first set of data items. The sum of the number of data items in the item is a predetermined value.
  • computing device 140 may merge the first set of data items and the second set of data items to generate a merged set. Afterwards, computing device 140 may generate text according to the text generation model based on the data items in the merged group, the attributes and entities represented by the data items in the merged group, and the labels corresponding to the data items in the merged group.
  • the labels corresponding to the data items are also entered, because the second group of data items of low importance will generally not be mentioned, by entering the corresponding labels can ensure that the generated The text includes only what is expected to be mentioned.
  • the number of data items in the first group of data items selected each time is random, the number of data items used to generate text is different from the number of data items used in training the text generation model, which will lead to exposure bias, making the generated The results were less than ideal.
  • the second group of data items so that the number of data items in the merged group of the two groups of data items is a certain value, especially when the text generation model is trained, the same number of data items is also used for training, Make the distribution of the input data as consistent as possible during training and actual use, thereby alleviating exposure bias.
  • the sum of the number of data items in the first group of data items and the number of data items in the first group of data items is equal to the number of all data items in the data table, that is, all data items are input into the text generation
  • the consistency of the input distribution during training and prediction is improved, which further alleviates the exposure bias problem.
  • the computing device 140 may convert the data items in the merged group, the attributes represented by the data items, and the corresponding labels into quantitative representations based on the same metric standard according to the quantization model. Later, when generating text according to the text generation model, in addition to quantized representations for data items, quantized representations for attributes, and quantized representations for labels, positional quantization that indicates the order of position of data items in a merged group is also introduced express. For example, a quantitative model can be used to transform the sequence of data items, attributes, and tags into corresponding matrices with the same dimension, so as to measure the correlation between these information. Similarly, the quantitative representation of the sequence of positions may be a matrix with the same dimension, and each dimension in the matrix indicates the position of a data item in the data table.
  • FIG. 4 shows a schematic block diagram of an example process 400 of generating text according to some embodiments of the present disclosure.
  • Process 400 may be implemented, for example, by computing device 140 in FIG. 1 or a different computing device.
  • Each row of dataset 410 represents a sequence of data of one type.
  • Data set 410 includes 4 types of data.
  • the first category (that is, the first row of data in the figure) is the value of each data item.
  • the second category (ie, the second row in the figure) is the attribute represented by the data item.
  • the third category (ie, the third row in the figure) is tags, where "T" indicates that the data item is of high importance and is to be mentioned, while "F” indicates that the data item is of low importance and will not be mentioned.
  • the fourth category (that is, the fourth row in the figure) is position, which indicates the position of the data item in the sequence.
  • each sequence also includes an entity [ENT] indicating the data item belongs to. For example, a sequence of values for a data item could be:
  • V [ENT] V 11 V 12 , . . . , [ENT] V 21 . . . , (1),
  • V ij represents the value of the data item in row i and column j in the data table 440 .
  • the sequence of properties is:
  • K [ENT]k 11 k 12 , . . . , [ENT]k 21 . . . , (2),
  • kij represents the attribute of the data item in row i and column j in the data table 440 .
  • f ij ⁇ ⁇ F, T ⁇ indicates whether the data item in row i and column j in the data table is mentioned, F indicates that the data item is not mentioned, and T indicates that the data item is to be mentioned.
  • [ENT] indicates the entity corresponding to the data item.
  • the quantization model can be, for example, a word2vec model. Subsequently, the weighted sum matrix H 0 of the resulting three matrices and the position matrix P is calculated:
  • H 0 (E V +E K +E F )/3+P (4).
  • the fusion of data is realized, and the matrix H 0 can be processed to obtain the text 430 by using a text generation model (such as an encoder-decoder model).
  • a text generation model such as an encoder-decoder model
  • the previously determined labels are input to the model together with the data table as features, so that no matter in the training or prediction phase, all the data items in the data table are input.
  • the inconsistency of the input data distribution during model training and prediction is minimized as much as possible, thereby greatly alleviating the problem of exposure bias.
  • a text generation model can be trained such that a set of data items appear in text according to rules based on the size of the data values. In a report on a specific event, some entities usually appear according to the rules based on the numerical value of the same attribute.
  • the rule may include: the data item with the largest numerical value among the data items of the same attribute; the data item with the largest numerical value among the data items of the same attribute; the data item with a numerical value greater than the first threshold among the data items of the same attribute; Among the data items of the attribute, the data items whose values are smaller than the first threshold; the values of the data items of the same attribute are sorted in ascending order; and the values of the data items of the same attribute are sorted in descending order.
  • the concerned rules roughly include the following six categories:
  • the name of the player with the highest value of a certain attribute among several players for example, the player with the highest score or the highest assist among the players with high attention in the two teams;
  • the value of a certain attribute among several players is greater than the preset threshold, such as the specific score of a player with a score greater than 10;
  • the values of an attribute are sorted according to the value, for example, the scores are sorted in ascending order.
  • a pre-training task is set for the text generation model, that is, first use the sampled data items as input and satisfy the rule The data items are used as a result to train the text generation model, which is then fine-tuned in downstream tasks so that the text generation model learns predetermined rules.
  • the text generation model can determine the mutual order of the first group of data items according to predetermined rules, thereby improving the logic of the generated text.
  • the training set used when training the text generation model, not only includes the training data items in the training data table for events, but also includes the data items associated with the rules in the training data table.
  • the training data items in the training data set are divided by entities, and the data items associated with the rules are included in the training data set as an additional group.
  • the numerical sequence of the training data item can be:
  • the sequence of attributes for a training data item is:
  • the additional group for the numerical sequence is "[Max_Name]20223418” and the additional group for the attribute sequence is "[Max_Name]Tom.ScoreBob. Score Casper.Score John.Score”.
  • the training dataset is transformed into sequences with the same sequence structure as previously discussed.
  • the corresponding sequences satisfying the rules are "34" and "Casper.Score”.
  • the text generation model is fine-tuned by referring to the sequence that satisfies the rules so that the text generation model can learn the rules.
  • the text generation model can learn the preset rules and present the data items according to the rules, so that the generated text is more accurate. Reasonable.
  • FIG. 5 shows a schematic block diagram of an example overall process 500 of selecting data items and generating text, according to some embodiments of the present disclosure.
  • Process 500 may be implemented, for example, by computing device 140 in FIG. 1 or a different computing device.
  • data table 440 is processed using label classification model 360 to generate label vector 370 .
  • the text 510 is then generated by processing the data table 440 and the generated label vector 370 using the text generative model 420 .
  • the text 510 is more linguistically logical. For example, the focused winner Sword Team is introduced first, and the players of the Sword Team and Overlord Team are introduced in descending order of scores. Also, more players from Swordsmen were introduced, and none of the underperforming players from Overlords were introduced.
  • Fig. 6 shows a block diagram of an apparatus 500 for selecting data items according to some embodiments of the present disclosure.
  • Apparatus 600 may be implemented as or included in computing device 140, for example.
  • Each module/component in the device 600 may be implemented by hardware, software, firmware or any combination thereof.
  • the apparatus 600 includes an acquisition module 610 configured to acquire data items in a data table, each data item represents a value of an entity for a corresponding attribute.
  • the apparatus 600 further includes a determination module 620 configured to determine, based on the data item, labels corresponding to the data item according to a label classification model, each label indicating the importance of the corresponding data item.
  • the apparatus 600 also includes a selection module 630 configured to select a group of data items from the data items based on the tags for generating a piece of text describing events associated with the data table.
  • FIG. 7 shows a block diagram illustrating a computing device 700 in which one or more embodiments of the present disclosure may be implemented. It should be understood that the computing device 700 shown in FIG. 7 is exemplary only and should not constitute any limitation on the functionality and scope of the embodiments described herein. The computing device 700 shown in FIG. 7 may be used to implement the computing system 110 of FIG. 1 .
  • computing device 700 is in the form of a general-purpose computing device.
  • Components of computing device 700 may include, but are not limited to, one or more processors or processing units 710, memory 720, storage devices 730, one or more communication units 740, one or more input devices 750, and one or more output devices 760.
  • the processing unit 710 may be an actual or virtual processor and is capable of performing various processes according to programs stored in the memory 720 .
  • multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capability of the computing device 700 .
  • Computing device 700 typically includes a plurality of computer storage media. Such media can be any available media that is accessible by computing device 700, including but not limited to, volatile and nonvolatile media, removable and non-removable media.
  • Memory 720 can be volatile memory (eg, registers, cache, random access memory (RAM)), nonvolatile memory (eg, read only memory (ROM), electrically erasable programmable read only memory (EEPROM) , flash memory) or some combination of them.
  • Storage device 730 may be removable or non-removable media, and may include machine-readable media, such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training ) and can be accessed within computing device 700.
  • Computing device 700 may further include additional removable/non-removable, volatile/nonvolatile storage media.
  • a disk drive for reading from or writing to a removable, nonvolatile disk such as a "floppy disk"
  • a disk drive for reading from a removable, nonvolatile disk may be provided.
  • CD-ROM drive for reading or writing.
  • each drive may be connected to the bus (not shown) by one or more data media interfaces.
  • Memory 720 may include a computer program product 725 having one or more program modules configured to perform the various methods or actions of the various embodiments of the present disclosure.
  • the communication unit 740 enables communication with other computing devices through the communication medium. Additionally, the functionality of the components of computing device 700 may be implemented in a single computing cluster or as a plurality of computing machines capable of communicating via communication links. Accordingly, computing device 700 may operate in a networked environment using logical connections to one or more other servers, a network personal computer (PC), or another network node.
  • PC network personal computer
  • Input device 750 may be one or more input devices, such as a mouse, keyboard, trackball, and the like.
  • Output device 760 may be one or more output devices, such as a display, speakers, printer, or the like.
  • the computing device 700 can also communicate with one or more external devices (not shown) through the communication unit 740 as needed, such as storage devices, display devices, etc., and one or more devices that enable the user to interact with the computing device 700 In communication, or with any device (eg, network card, modem, etc.) that enables computing device 700 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
  • I/O input/output
  • a computer-readable storage medium on which computer-executable instructions are stored, wherein the computer-executable instructions are executed by a processor to implement the methods described above.
  • a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.
  • These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, Instructions executed on computers, other programmable data processing devices, or other devices can thus implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

根据本公开的实施例,提供了一种用于生成文本的方法、装置、设备、存储介质和程序产品。在此描述的方法包括:获取数据表中的数据项,每个所述数据项表示一个实体针对相应属性的取值;基于所述数据项,根据标签分类模型来确定与数据项对应的标签,每个标签指示相应数据项的重要性;以及基于标签从数据项中选取第一组数据项,以用于生成一段文本,文本描述与数据表相关联的事件。根据本公开的实施例,能够避免选取与取值相关的重要性低的数据项以及遗漏重要性高的数据项。

Description

生成文本的方法和装置
相关申请的交叉引用
本申请要求申请号为202111406999.6,题为“生成文本的方法和装置”、申请日为2021年11月24日的中国发明专利申请的优先权,通过引用的方式将该申请整本并入本文。
技术领域
本公开的各实现方式涉及计算机领域,更具体地,涉及生成文本的方法、装置、设备和计算机存储介质。
背景技术
文本自动生成是自然语言处理(Natural Language Processing,简称NLP)领域的一个重要研究方向,实现文本自动生成也是人工智能走向成熟的一个重要标志。在信息快速传播的当今时代,如何快速准确地利用在一些事件中所产生的数据表来自动地生成描述该事件的文本,在自然语言领域中得到了广泛的关注。
例如,在一项体育赛事结束时,媒体将会发布针对该赛事的相关报道。在报道中,会介绍赛事的相关情况,例如赛事的背景、参赛的人员、赛事的结果以及运动员的相关表现。人们会希望在第一时间获得针对该赛事的相关报道。在传统上,体育赛事的简报都是由体育新闻记者来撰写的。随者自然语言处理的发展,可以利用NLP技术来撰写相关报道。
目前为止,用于生成赛事报道的神经网络模型一般会先将数据表中的数据项看作是“词”,然后将这些“词”序列化,从而将整张表格就转换为一个关于词的序列。然后训练一个数据表表序列到赛事报告序列的神经网络模型,利用训练好的模型来生成赛事报告。
发明内容
在本公开的第一方面,提供了一种计算机实现的方法。该方法包括:获取数据表中的数据项,每个数据项表示一个实体针对相应属性的取值;基于数据项,根据标签分类模型来确定与数据项对应的标签,每个标签指示相应数据项的重要性;以及基于标签从数据项中选取第一组数据项,以用于生成一段文本,文本描述与数据表相关联的事件。
在一些实施例中,从数据项中选取第一组数据项包括:根据标签,从数据项中选择重要性高于重要性阈值的数据项。
在一些实施例中,方法还包括:基于第一组数据项以及第一组数据项所表示的实体和属性,根据文本生成模型来生成文本,其中文本中包含的与第一组数据项相关联的词语由文本生成模型从语料库中按预定顺序选取。
在一些实施例中,根据文本生成模型来生成文本包括:从数据项中选取第二组数据项,使得第二组数据项中的数据项数目与第一组数据项中的数据项数目的和为预定值;合并第一组数据项和第二组数据项,以生成合并组;以及基于合并组中的数据项、合并组中的数据项所表示的属性和实体以及与合并组中的数据项对应的标签,根据文本生成模型来生成文本。
在一些实施例中,根据文本生成模型来生成文本包括:分别基于合并组中的数据项、合并组中的数据项所表示的属性以及与合并组中的数据项对应的标签,根据量化模型,来确定基于相同度量标准的针对数据项的量化表示、针对属性的量化表示以及针对标签的量化表示;以及基于针对数据项的量化表示、针对属性的量化表示、针对标签的量化表示以及合并组中的数据项的位置循序,根据文本生成模型来生成文本。
在一些实施例中,文本生成模型被训练,以使得一组数据项按照基于数据值大小的规则而出现在文本中。
在一些实施例中,规则包括以下至少一项:针对同一属性的数据项中数值最大的数据项;针对同一属性的数据项中数值最大的数 据项;针对同一属性的数据项中数值大于第一阈值的数据项;针对同一属性的数据项中数值小于第一阈值的数据项;针对同一属性的数据项的数值升序排列;以及针对同一属性的数据项的数值降序排列。
在一些实施例中,训练文本生成模型所使用的训练集包括针对事件的训练数据表中的训练数据项、以及训练数据表中与规则相关联的数据项;其中训练数据集中的训练数据项根据其所表示的实体而划分成组,与规则相关联的数据项作为附加组而被包括在训练数据集中。
在一些实施例中,数据项包括字符数据项以及数值数据项,并且基于数据项,根据标签分类模型来确定与数据项对应的标签包括:为字符数据项指派对应数值;以及基于数值数据项和所指派的对应数值,根据标签分类模型来确定与数据项对应的标签。
在本公开的第二方面中,提供了一种用于生成文本的装置。该装置包括:获取模块,被配置为获取数据表中的数据项,每个数据项表示一个实体针对相应属性的取值;确定模块,被配置为基于数据项,根据标签分类模型来确定与数据项对应的标签,每个标签指示相应数据项的重要性;以及选取模块,被配置为基于标签从数据项中选取一组数据项,以用于生成一段文本,文本描述与数据表相关联的事件。
在一些实施例中,选取模块还被配置为:根据标签,从数据项中选择重要性高于重要性阈值的数据项。
在一些实施例中,该装置还包括:生成模块,被配置为基于第一组数据项以及第一组数据项所表示的实体和属性,根据文本生成模型来生成文本,其中文本中包含的与第一组数据项相关联的词语由文本生成模型从语料库中按预定顺序选取。
在一些实施例中,生成模块还被配置为:从数据项中选取第二组数据项,使得第二组数据项中的数据项数目与第一组数据项中的数据项数目的和为预定值;合并第一组数据项和第二组数据项,以 生成合并组;以及基于合并组中的数据项、合并组中的数据项所表示的属性和实体以及与合并组中的数据项对应的标签,根据文本生成模型来生成文本。
在一些实施例中,生成模块还被配置为:分别基于合并组中的数据项、合并组中的数据项所表示的属性以及与合并组中的数据项对应的标签,根据量化模型,来确定基于相同度量标准的针对数据项的量化表示、针对属性的量化表示以及针对标签的量化表示;以及基于针对数据项的量化表示、针对属性的量化表示、针对标签的量化表示以及合并组中的数据项的位置循序,根据文本生成模型来生成文本。
在一些实施例中,文本生成模型被训练,以使得一组数据项按照基于数据值大小的规则而出现在文本中。
在一些实施例中,规则包括以下至少一项:针对同一属性的数据项中数值最大的数据项;针对同一属性的数据项中数值最大的数据项;针对同一属性的数据项中数值大于第一阈值的数据项;针对同一属性的数据项中数值小于第一阈值的数据项;针对同一属性的数据项的数值升序排列;以及针对同一属性的数据项的数值降序排列。
在一些实施例中,训练文本生成模型所使用的训练集包括针对事件的训练数据表中的训练数据项、以及训练数据表中与规则相关联的数据项,其中训练数据集中的训练数据项根据其所表示的实体而划分成组,与规则相关联的数据项作为附加组而被包括在训练数据集中。
在一些实施例中,数据项包括字符数据项以及数值数据项,并且确定模块还被配置为:为字符数据项指派对应数值;以及基于数值数据项和所指派的对应数值,根据标签分类模型来确定与数据项对应的标签。
在本公开的第三方面,提供了一种电子设备,包括:存储器和处理器;其中存储器用于存储一条或多条计算机指令,其中一条或 多条计算机指令被处理器执行以实现根据本公开的第一方面的方法。
在本公开的第四方面,提供了一种计算机可读存储介质,其上存储有一条或多条计算机指令,其中一条或多条计算机指令被处理器执行实现根据本公开的第一方面的方法。
在本公开的第五方面,提供了一种计算机程序产品,其包括一条或多条计算机指令,其中一条或多条计算机指令被处理器执行实现根据本公开的第一方面的方法。
根据本公开的各种实施例,能够基于自监督的方法来有效地构建配体分子,从而提高了方法的普适性。
附图说明
结合附图并参考以下详细说明,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标注表示相同或相似的项,其中:
图1示出了本公开的实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一些实施例的选取数据项的示例方法的流程图;
图3示出了根据本公开的一些实施例的确定标签的示例过程的示意性框图;
图4示出了根据本公开的一些实施例的生成文本的示例过程的示意性框图;
图5示出了根据本公开的一些实施例的选取数据项和生成本文的示例总过程的示意性框图;
图6出了根据本公开的一些实施例的用于确定响应语句的装置的框图;以及
图7出了能够实施本公开的多个实施例的设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
如本文中所使用的,术语“模型”可以从训练数据中学习到相应的输入与输出之间的关联,从而在训练完成后可以针对给定的输入,生成对应的输出。模型的生成可以基于机器学习技术。深度学习是一种机器学习算法,通过使用多层处理单元来处理输入和提供相应输出。神经网络模型是基于深度学习的模型的一个示例。在本文中,“模型”也可以被称为“机器学习模型”、“学习模型”、“机器学习网络”或“学习网络”,这些术语在本文中可互换地使用。如本文所使用的,属于“确定模型的参数”或类似表达是指确定模型的参数的值(又称为参数值),包括具体值、取值集合或取值范围等。
“神经网络”是一种基于深度学习的机器学习网络。神经网络能够处理输入并且提供相应输出,其通常包括输入层和输出层以及在输入层与输出层之间的一个或多个隐藏层。在深度学习应用中使用的神经网络通常包括许多隐藏层,从而增加网络的深度。神经网络的各个层按顺序相连,从而前一层的输出被提供作为后一层的输入,其中输入层接收神经网络的输入,而输出层的输出作为神经网络的最终输出。神经网络的每个层包括一个或多个节点(也称为处 理节点或神经元),每个节点处理来自上一层的输入。
通常,机器学习大致可以包括三个阶段,即训练阶段、测试阶段和应用阶段(也称为推理阶段)。在训练阶段,给定的模型可以使用大量的训练数据进行训练,不断迭代更新模型的参数值,直到模型能够从训练数据中获取一致的满足预期目标的推理。通过训练,模型可以被认为能够从训练数据中学习从输入到输出之间的关联(也称为输入到输出的映射)。训练后的模型的参数值被确定。在测试阶段,将测试输入应用到训练后的模型,测试模型是否能够提供正确的输出,从而确定模型的性能。在应用阶段,模型可以被用于基于训练得到的参数值,对实际的输入进行处理,确定对应的输出。
在一些机器学习方案中,训练阶段又可以包括预训练和微调。预训练是指针对通用任务来训练模型,即迭代更新模型的参数值。经预训练的模型具有广泛的应用范围,可应用于多种不同的下游任务。微调是指针对将要应用模型的具体下游任务来训练经预训练的模型。微调后的模型更适于处理具体下游任务。
示例环境
图1示出了本公开的实施例能够在其中实现的示例环境100的示意图。在图1的环境100中,计算设备140接收针对某个事件的数据表110,处理数据表110以生成描述该事件的文本150。在图1所示的示例中,事件是针对霸王队和宝剑对的篮球赛。数据表110包括多个实体,例如各个球员和两个球队。数据表110中的数据项表示实体针对某个属性的取值。例如,实体“鲍勃”的一个属性“得分”的取值为22。应理解,当计算设备140接收数据表110中的数据项后,计算设备140便知晓数据项所对应的二维信息,即数据项所对应的实体和属性。
为了报道人们更加关注的内容,需要对文本的内容进行归化,因此,在计算设备140中设置有内容选取模块120和内容排序模块 130。内容选取模块120用于筛选重要的信息,而内容排序模块130用于使生成的文本更符合语言逻辑。
如图1所示,在生成的示例文本150中,报道了两队的胜负以及多个球员的各项数据。然而,可以看到“霸王队以101-118负于宝剑队”这句话不够合理,通常人们会将更受关注的胜者放在前面。此外,可以看出宝剑队中表现最好的球员是卡斯帕,但是在文本150中首先介绍了更有名气的史蒂夫,这可能是由于在训练模型时史蒂夫通常出现在卡斯帕之前。但是就所示出的数据表110来看,卡斯帕的表现更好,应当最先介绍。对于霸王队,从数据表110可以看出得到篮板最多的球员是杰瑞,但是在文本150中,杰瑞并没有出现。
虽然如图1所示的用于生成赛事的报道的机器学习模型具有内容选取模块120和内容排序模块130,然而在将数据表110中的数值和非数值数据项都看作“词”即字符来转化成序列时,数据表中的数据项所具有的数值信息便丢失了。例如,在处理数据表110时,球员的名字将被映射到一个整数ID,例如“鲍勃”将被映射到“1”。类似地,他的得分“22”会被映射到另一个ID,例如“2”。这些ID又会通过机器学习模型中的词嵌入层而被转换成一个向量或矩阵。这样一个词的序列就被映射到了一个向量的序列。然后就可以利用文本生成模型来处理得到的向量序列。在这个过程中,得分“22”的具体数值信息便丢失了。由此,在生成的文本中没有选取数据较好但可能名气不大的杰瑞,而选取了先发出场但数据较差的迈克尔。
由此可以看出,在文本生成过程中,数值信息对内容选取和内容的排序组织有很大的影响,例如球员的数据好坏不仅在内容选择时是重要的,而且在对选择好的数据进行进一步的排序时,也是重要的考虑因素。
根据本公开的实现,提供了一种用于选取要被提及的数据项以及用于生成文本的方案。在该方案中,在获取针对事件的数据表中的数据项之后,利用标签分类模型来处理数据项,得到与数据项对 应的、与数据项的重要性相关联的标签。最后,通过标签来选取数据项,以用于生成描述该事件的文本。
通过利用数据项的重要性进行标签分类来选取要在文本中提及的数据项,本公开的实施例能够基于数据项本身的重要性来更有逻辑性地选取要提及的文本内容,从而避免了所生成的文本遗失重要性高的内容,同时避免包括重要性低的内容。
数据项选取
图2示出了根据本公开的一些实施例的选取数据项的示例方法200的流程图。方法200例如可以由图1中的计算设备140或者不同的计算设备来实施。
如图2所示,在框202处,计算设备140获取数据表中的数据项。数据表中的每个数据项表示一个实体针对相应属性的取值。如上所讨论的,数据表可以是针对一场篮球比赛的各个球员的数据统计。相对地,数据表中的每个数据项便代表了一个球员或者球队(即实体)的一项数据,诸如球员的得分(即属性)的数值。
在框204处,计算设备140基于数据项,根据标签分类模型来确定与数据项对应的标签。每个标签指示相应数据项的重要性。例如,重要性与数据项的取值相关联的。在一些实施例中,如果数据项的取值高于取决于数据项所表示的属性的预定阈值,那么该数据项具有高的重要性。在一些实施例中,如果数据项的取值高于其他实体针对相同属性的数据项取值,那么该数据项具有高的重要性。在一些实施例中,如果实体的数据项的取值与该实体的历史数据偏差大于预定阈值,那么该数据具有高的重要性。例如,在篮球比赛中,当一名球员的得分大于20时,该数据项具有高的重要性;当一名球员得到了全队最高的得分时,该数据项具有高的重要性;当一名球员的得分比其自身的历史得分高20分时,该数据项同样具有高的重要性。
在一些实施例中,数据表中的数据项包括字符数据项(例如,球员所担任的位置,诸如前锋、后卫和中锋等等)以及数值数据项(例如,球员的等分、助攻和篮板等等)。在一些实施例中,在处理数据项之前,可以为字符数据项指派相应数值。例如,可以为“前锋”指派数值“0”。备选地,在一些实施例中,可以丢弃字符数据项。关于数据项的选取的示例过程将在下文参考图3详细描述。
图3示出了根据本公开的一些实施例的确定标签的示例过程300的示意性框图。过程300例如可以由图1中的计算设备140或者不同的计算设备来实施。
如图3所示,当计算设备140接收到数据表后,数据表中包含两类实体,即球员和球队。相应地,数据表中有包括球员数据项部分310以及球队数据项部分330。在获取数据项之后,字符数据项根据预定规则被指派相应的数值。如图3所示,球员的位置“前锋”、“中锋”以及“后卫”被分别指派了数值“0”“2”“1”。之后,提取球员数据项部分310和球队数据项部分320中的所有数据项,以获得球员数据项集320以及球队数据项集330,并将所有数据项拼接成数据向量350。此后,数据向量350被输入到标签分类模型360中,以确定与数据向量350中的每个维度的数据项的标签向量370。标签向量370中的每个维度指示与其对应的数据项的重要性。在图3所示的示例中,标签具有“0”和“1”两个值,其中“1”指示对应的数据项要被提及,而“0”指示所对应的数据项不被提及。
通过这样的方式,数据项选取就被建模为一个多标签分类问题。计算设备140基于数据向量350,根据标签分类模型360来预测一个具有值“0”和“1”的标签向量370,然后从标签向量370中确定数据表中的哪些数据项需要被提及。在一些实施例中,标签分类模型360可以是基于向量的任何分类器模型。在一些实施例中,标签分类模型是XGBoost模型。
应理解,将数据项拼接为向量仅是本公开的一种示例实现方式。数据项也可以以其他合适的形式被输入到对应的分类器模型中。本 公开不意在对此进行限制。
返回图2,在框206处,计算设备140基于标签从数据项中选取第一组数据项,以用于生成一段文本。该文本描述与数据表相关联的事件。在一些实施例中,计算设备140可以根据所确定的标签,从数据项中选择重要性高于重要性阈值的数据项。
以此方式,通过对数据项进行分类以确定每个数据项的重要性,能够确保在选取数据项时考虑了数据项本身的数值信息,从而避免选取实际重要性较低的数据项并且避免未选取实际重要性较高的数据项。
文本生成
如上文所讨论的,利用方法200所选取的一组数据项可以被用于生成文本,在一些实施例中,计算设备140可以基于选取的第一组数据项以及第一组数据项所表示的实体和属性,根据文本生成模型来生成文本。在该实施例中,所生成的文本包含的词语是与第一组数据项相关联的,并且是由文本生成模型从语料库中按预定顺序选取。在此,预定顺序可以是符合自然语言规则的顺序,使得所生成的文本以自然语言文本的形式而呈现。在一些实施例中,文本生成模型是序列预测模型,其被训练以通过确定输入的字符序列中的序列项之间的相互关系,从通过训练而获得的语料库中选取与输入的序列相关联的词语,并根据自然语言规则确定词语的相互顺序,以生成符合自然语言规则的文本。应理解,“语料库”表示在经训练的文本生成模型中所包含的用于生成文本的词语的集合,而不被限制为特定的数据库。在一些实施例中,语料库被包括在文本生成模型中,并且是在对文本生成模型进行训练时所获得的。在一些实施例中,训练时所使用的数据集可以具有不同的语言,从而得到不同语言的语料库,从而使得所生成的文本能够以不同的语言来呈现。
备选地或者附加地,在一些实施例中,文本是通过将与数据项相关联的词语填入到预设的模板中生成的。
在一些实施例中,除了选取的第一组数据项意外,计算设备140还可以从数据项中选取第二组数据项,使得第二组数据项中的数据项数目与所述第一组数据项中的数据项数目的和为预定值。在选出第二组数据项之后,计算设备140可以合并第一组数据项和第二组数据项,以生成合并组。之后,计算设备140可以基于合并组中的数据项、合并组中的数据项所表示的属性和实体以及与合并组中的数据项对应的标签,根据文本生成模型来生成文本。在此,除了数据表中的信息外,还输入了与数据项对应的标签,这是因为第二组数据项的重要性低通常不将被提及,通过输入相应的标签能够保证所生成的文本只包括期望提及的内容。
由于每次选择的第一组数据项的数据项数目是随机的,因此被用于生成文本的数据项数目与训练文本生成模型时所使用的数据项数目不同,将导致曝光偏置,使得生成结果不够理想。根据本公开的实现,通过选取第二组数据项以使两组数据项的合并组的数据项数目为一定值,尤其在对文本生成模型进行训练时,也利用同样数目的数据项进行训练,使得在训练时和实际使用时所输入的数据的分布尽量一致,从而缓解曝光偏置。
在一些实施例中,第一组数据项中的数据项数目与第一组数据项中的数据项数目的和等于数据表中的所有数据项的数目,即将所有的数据项都输入到文本生成模型中,提高了训练时和预测时输入分布的一致性,从而进一步缓解了曝光偏置问题。
在一些实施例中,计算设备140可以根据量化模型,分布将合并组中的数据项、数据项所表示的属性以及对应的标签转化成基于相同度量标准的量化表示。之后,在根据文本生成模型来生成文本时,除了针对数据项的量化表示、针对属性的量化表示、针对标签的量化表示之外,还引入了指示合并组中的数据项的位置循序的位置量化表示。例如,可以利用量化模型分别将数据项、属性以及标签的序列转化为维度相同的对应矩阵,以用于度量这些信息之间的关联性。类似地,位置循序的量化表示可以是具有相同维度的矩阵, 矩阵中的每个维度指示一个数据项在数据表中的位置。
以此方式,通过将数据项、数据项所表示的属性以及对应的标签进行量化,能够将与数据项相关联的各种信息融合在一起,使得输入到文本生成模型的数据更加全面。
关于文本的生成的示例过程将在下文参考图4详细描述。
图4示出了根据本公开的一些实施例的生成文本的示例过程400的示意性框图。过程400例如可以由图1中的计算设备140或者不同的计算设备来实施。
如图4所示,通过将预先确定的标签向量370与数据表440结合,得到了数据集410。数据集410的每一行表示一类数据的序列。数据集410包括4类数据。第一类(即图中的第一行数据)是每个数据项的数值。第二类(即图中的第二行)是数据项所表示的属性。第三类(即图中的第三行)是标签,其中“T”指示数据项的重要性高,要被提及,而“F”指示数据项的重要性低,将不被提及。第四类(即图中的第四行)是位置,指示了数据项在序列中的位置。此外,每个序列中还包括表示数据项所属的实体[ENT]。例如,数据项的数值的序列可以是:
V=[ENT]V 11V 12,...,[ENT]V 21...,   (1),
其中V ij表示数据表440中第i行第j列的数据项的取值。
属性的序列是:
K=[ENT]k 11k 12,...,[ENT]k 21...,   (2),
其中的k ij表示数据表440中第i行第j列的数据项的属性。
标签序列是:
F=[ENT]f 11f 12,...,[ENT]f 21...,   (3),
其中,f ij∈{F,T}表示数据表中第i行第j列的数据项是否被提及,F指示数据项不被提及,而T指示数据项要被提及。[ENT]表示数据项所对应的实体。
在获得三个序列之后,利用量化模型将他们分别转化为具有相同维度的三个矩阵E V,E K,E F。量化模型例如可以是word2vec模 型。随后,计算所得到的三个矩阵以及位置矩阵P的加权和矩阵H 0
H 0=(E V+E K+E F)/3+P   (4)。
由此,实现了数据的融合,就可以利用文本生成模型(例如编码器-解码器模型)来处理矩阵H 0以获得文本430。
将先前确定的标签作为特征与数据表一起输入给模型,这样无论在训练还是预测阶段,输入的都是数据表中全部的数据项。尽可能缩小了模型训练和预测时输入数据分布的不一致,进而大幅缓解了曝光偏置问题。
数据项排序
根据如上讨论,文本内容的排序也会影响到文本的可读性。下面将描述通过对数据项排序,从而实现对文本内容的排序。在一些实施例中,文本生成模型可以被训练,以使得一组数据项按照基于数据值大小的规则而出现在文本中。在针对特定的事件的报道中,一些实体通常是按照基于同一属性的数值大小的规则出现的。例如,规则可以包括:针对同一属性的数据项中数值最大的数据项;针对同一属性的数据项中数值最大的数据项;针对同一属性的数据项中数值大于第一阈值的数据项;针对同一属性的数据项中数值小于第一阈值的数据项;针对同一属性的数据项的数值升序排列;以及针对同一属性的数据项的数值降序排列。具体地,例如针对篮球比赛,所关注的规则大致包括以下6类:
1.几个球员中某个属性的取值最大的球员的名字,例如在两队中受关注度较高的几个球员中得分最高者,或者助攻最高者;
2.几个球员中某个属性的取值最大的球员的属性取值,例如得分最高者的得分;
3.几个球员中某个属性的取值大于预设的阈值的球员的名字,例如得分大于10的球员的名字;
4.几个球员中某个属性的取值大于预设的阈值的球员的属性取值,例如得分大于10的球员的具体得分;
5.几个球员按照某个属性的取值大小进行排序,例如受关注的几个球员按照得分的升序来排序;
6.某个属性的取值按照取值大小排序,例如将得分以升序排序。
在一些实施中,为了能够让文本生成模型能够以预定的规则呈现数据项,例如基于迁移学习思想,为文本生成模型设置了一个预训练任务,即先利用采样的数据项作为输入以及满足规则的数据项作为结果来训练文本生成模型,然后在下游任务进行微调,以使文本生成模型学习预定的规则。由此,便能够使文本生成模型按照预定的规则来确定第一组数据项的相互顺序,从而提高所生成的文本的逻辑性。
在一些实施例中,在训练文本生成模型时,所使用的训练集除了包括针对事件的训练数据表中的训练数据项,还包括所述训练数据表中与规则相关联的数据项。在此,训练数据集中的训练数据项按实体而划分,并且与规则相关联的数据项作为附加组而被包括在训练数据集中。具体地,训练数据项的数值序列可以是:
Figure PCTCN2022125780-appb-000001
其中
Figure PCTCN2022125780-appb-000002
是与规则相关联的数据项序列中的第i个数据项;
训练数据项的属性序列是:
Figure PCTCN2022125780-appb-000003
其中
Figure PCTCN2022125780-appb-000004
是与规则相关联的数据项序列中的第i个数据项的属性。此外,
Figure PCTCN2022125780-appb-000005
Figure PCTCN2022125780-appb-000006
之前的[ENT]表示特定的规则。
例如,在规则是在几个球员中的得分最大的球员的名字的情况中,数值序列的附加组为“[Max_Name]20223418”,属性序列的附加组为“[Max_Name]汤姆.得分鲍勃.得分卡斯帕.得分约翰.得分”。由此,便将训练数据集转化为与先前所讨论的序列结构相同的序列。同时,对应的满足规则的序列为“34”和“卡斯帕.得分”。通过参照满足规则的序列来对文本生成模型进行微调以使文本生成模型实现对规则的学习。
以此方式,通过将与预设的规则相关联的数据项引入到数据项 序列和属性序列中,能够使文本生成模型学习预设的规则并以该规则呈现数据项,从而使生成的文本更合理。
图5示出了根据本公开的一些实施例的选取数据项和生成本文的示例总过程500的示意性框图。过程500例如可以由图1中的计算设备140或者不同的计算设备来实施。
如图5所示,在计算设备140接收数据表440后,利用标签分类模型360处理数据表440,以生成标签向量370。之后,利用本文生成模型420来处理数据表440和所生成的标签向量370,从而生成了文本510。与图1中所示的文本150不同,文本510更加符合语言逻辑。例如,受关注的胜者宝剑队被优先介绍,并且宝剑队和霸王对的球员都按照得分的降序来进行介绍。此外,胜者宝剑队有更多的球员被介绍,并且败者霸王队中表现不好的球员都没有被介绍。
示例装置和设备
图6示出了根据本公开的一些实施例的用于选取数据项的装置500的框图。装置600可以被实现为或者被包括在例如计算设备140中。装置600中的各个模块/组件可以由硬件、软件、固件或者它们的任意组合来实现。
如图所示,装置600包括获取模块610,获取模块610被配置为获取数据表中的数据项,每个数据项表示一个实体针对相应属性的取值。
装置600还包括确定模块620,确定模块620被配置为基于所述数据项,根据标签分类模型来确定与所述数据项对应的标签,每个标签指示相应数据项的重要性。
装置600还包括选取模块630,选取模块630被配置为基于标签从数据项中选取一组数据项,以用于生成一段文本,文本描述与数据表相关联的事件。
图7示出了示出了其中可以实施本公开的一个或多个实施例的计算设备700的框图。应当理解,图7所示出的计算设备700仅仅 是示例性的,而不应当构成对本文所描述的实施例的功能和范围的任何限制。图7所示出的计算设备700可以用于实现图1的计算***110。
如图7所示,计算设备700是通用计算设备的形式。计算设备700的组件可以包括但不限于一个或多个处理器或处理单元710、存储器720、存储设备730、一个或多个通信单元740、一个或多个输入设备750以及一个或多个输出设备760。处理单元710可以是实际或虚拟处理器并且能够根据存储器720中存储的程序来执行各种处理。在多处理器***中,多个处理单元并行执行计算机可执行指令,以提高计算设备700的并行处理能力。
计算设备700通常包括多个计算机存储介质。这样的介质可以是计算设备700可访问的任何可以获得的介质,包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器720可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如,只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备730可以是可拆卸或不可拆卸的介质,并且可以包括机器可读介质,诸如闪存驱动、磁盘或者任何其他介质,其可以能够用于存储信息和/或数据(例如用于训练的训练数据)并且可以在计算设备700内被访问。
计算设备700可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图7中示出,可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中,每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器720可以包括计算机程序产品725,其具有一个或多个程序模块,这些程序模块被配置为执行本公开的各种实施例的各种方法或动作。
通信单元740实现通过通信介质与其他计算设备进行通信。附 加地,计算设备700的组件的功能可以以单个计算集群或多个计算机器来实现,这些计算机器能够通过通信连接进行通信。因此,计算设备700可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。
输入设备750可以是一个或多个输入设备,例如鼠标、键盘、追踪球等。输出设备760可以是一个或多个输出设备,例如显示器、扬声器、打印机等。计算设备700还可以根据需要通过通信单元740与一个或多个外部设备(未示出)进行通信,外部设备诸如存储设备、显示设备等,与一个或多个使得用户与计算设备700交互的设备进行通信,或者与使得计算设备700与一个或多个其他计算设备通信的任何设备(例如,网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。
根据本公开的示例性实现方式,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,其中计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式,还提供了一种计算机程序产品,计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令,而计算机可执行指令被处理器执行以实现上文描述的方法。
这里参照根据本公开实施例的方法、装置、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包 括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上,使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实施例中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所公开的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。

Claims (13)

  1. 一种计算机实现的方法,包括:
    获取数据表中的数据项,每个所述数据项表示一个实体针对相应属性的取值;
    基于所述数据项,根据标签分类模型来确定与所述数据项对应的标签,每个标签指示相应数据项的重要性;以及
    基于所述标签从所述数据项中选取第一组数据项,以用于生成一段文本,所述文本描述与所述数据表相关联的事件。
  2. 根据权利要求1所述的方法,其中从所述数据项中选取第一组数据项包括:
    根据所述标签,从所述数据项中选择所述重要性高于重要性阈值的数据项。
  3. 根据权利要求1所述的方法,还包括:
    基于所述第一组数据项以及所述第一组数据项所表示的实体和属性,根据文本生成模型来生成所述文本,
    其中所述文本中包含的与所述第一组数据项相关联的词语由所述文本生成模型从语料库中按预定顺序选取。
  4. 根据权利要求3所述的方法,其中根据所述文本生成模型来生成所述文本包括:
    从所述数据项中选取第二组数据项,使得所述第二组数据项中的数据项数目与所述第一组数据项中的数据项数目的和为预定值;
    合并所述第一组数据项和所述第二组数据项,以生成合并组;以及
    基于所述合并组中的数据项、所述合并组中的数据项所表示的属性和实体以及与所述合并组中的数据项对应的标签,根据所述文本生成模型来生成所述文本。
  5. 根据权利要求4所述的方法,其中根据所述文本生成模型来 生成所述文本包括:
    分别基于所述合并组中的数据项、所述合并组中的数据项所表示的属性以及与所述合并组中的数据项对应的标签,根据量化模型,来确定基于相同度量标准的针对数据项的量化表示、针对属性的量化表示以及针对标签的量化表示;以及
    基于针对数据项的量化表示、针对属性的量化表示、针对标签的量化表示以及所述合并组中的数据项的位置循序,根据所述文本生成模型来生成所述文本。
  6. 根据权利要求3所述的方法,其中所述文本生成模型被训练,以使得所述一组数据项按照基于数据值大小的规则而出现在所述文本中。
  7. 根据权利要求6所述的方法,其中所述规则包括以下至少一项:
    针对同一属性的数据项中数值最大的数据项;
    针对同一属性的数据项中数值最大的数据项;
    针对同一属性的数据项中数值大于第一阈值的数据项;
    针对同一属性的数据项中数值小于第一阈值的数据项;
    针对同一属性的数据项的数值升序排列;以及
    针对同一属性的数据项的数值降序排列。
  8. 根据权利要求6所述的方法,其中训练所述文本生成模型所使用的训练集包括针对所述事件的训练数据表中的训练数据项、以及所述训练数据表中与所述规则相关联的数据项;
    其中所述训练数据集中的训练数据项根据其所表示的实体而划分成组,与所述规则相关联的数据项作为附加组而被包括在所述训练数据集中。
  9. 根据权利要求1所述的方法,其中所述数据项包括字符数据项以及数值数据项,并且基于所述数据项,根据标签分类模型来确定与所述数据项对应的标签包括:
    为所述字符数据项指派对应数值;以及
    基于所述数值数据项和所指派的对应数值,根据标签分类模型来确定与所述数据项对应的标签。
  10. 一种用于生成文本的装置,包括:
    获取模块,被配置为获取数据表中的数据项,每个所述数据项表示一个实体针对相应属性的取值;
    确定模块,被配置为基于所述数据项,根据标签分类模型来确定与所述数据项对应的标签,每个标签指示相应数据项的重要性;以及
    选取模块,被配置为基于所述标签从所述数据项中选取一组数据项,以用于生成一段文本,所述文本描述与所述数据表相关联的事件。
  11. 一种电子设备,包括:
    存储器和处理器;
    其中所述存储器用于存储一条或多条计算机指令,其中所述一条或多条计算机指令被所述处理器执行以实现根据权利要求1至9中任一项所述的方法。
  12. 一种计算机可读存储介质,其上存储有一条或多条计算机指令,其中所述一条或多条计算机指令被处理器执行以实现根据权利要求1至9中任一项所述的方法。
  13. 一种计算机程序产品,包括一条或多条计算机指令,其中所述一条或多条计算机指令被处理器执行以实现根据权利要求1至9中任一项所述的方法。
PCT/CN2022/125780 2021-11-24 2022-10-17 生成文本的方法和装置 WO2023093372A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111406999.6 2021-11-24
CN202111406999.6A CN114091446A (zh) 2021-11-24 2021-11-24 生成文本的方法和装置

Publications (1)

Publication Number Publication Date
WO2023093372A1 true WO2023093372A1 (zh) 2023-06-01

Family

ID=80304151

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/125780 WO2023093372A1 (zh) 2021-11-24 2022-10-17 生成文本的方法和装置

Country Status (2)

Country Link
CN (1) CN114091446A (zh)
WO (1) WO2023093372A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091446A (zh) * 2021-11-24 2022-02-25 北京有竹居网络技术有限公司 生成文本的方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5930803A (en) * 1997-04-30 1999-07-27 Silicon Graphics, Inc. Method, system, and computer program product for visualizing an evidence classifier
US20100211535A1 (en) * 2009-02-17 2010-08-19 Rosenberger Mark Elliot Methods and systems for management of data
CN108897857A (zh) * 2018-06-28 2018-11-27 东华大学 面向领域的中文文本主题句生成方法
CN110532451A (zh) * 2019-06-26 2019-12-03 平安科技(深圳)有限公司 针对政策文本的检索方法和装置、存储介质、电子装置
CN114091446A (zh) * 2021-11-24 2022-02-25 北京有竹居网络技术有限公司 生成文本的方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078825A (zh) * 2019-12-20 2020-04-28 北京百度网讯科技有限公司 结构化处理方法、装置、计算机设备及介质
CN111310927B (zh) * 2020-01-19 2022-04-15 哈尔滨工业大学 一种引入推理机制的文本生成方法
CN112069321B (zh) * 2020-11-11 2021-02-12 震坤行网络技术(南京)有限公司 用于文本层级分类的方法、电子设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5930803A (en) * 1997-04-30 1999-07-27 Silicon Graphics, Inc. Method, system, and computer program product for visualizing an evidence classifier
US20100211535A1 (en) * 2009-02-17 2010-08-19 Rosenberger Mark Elliot Methods and systems for management of data
CN108897857A (zh) * 2018-06-28 2018-11-27 东华大学 面向领域的中文文本主题句生成方法
CN110532451A (zh) * 2019-06-26 2019-12-03 平安科技(深圳)有限公司 针对政策文本的检索方法和装置、存储介质、电子装置
CN114091446A (zh) * 2021-11-24 2022-02-25 北京有竹居网络技术有限公司 生成文本的方法和装置

Also Published As

Publication number Publication date
CN114091446A (zh) 2022-02-25

Similar Documents

Publication Publication Date Title
Yu et al. Hierarchical deep click feature prediction for fine-grained image recognition
Zhang et al. Multilabel image classification with regional latent semantic dependencies
Zhang et al. Zero-shot sketch-based image retrieval via graph convolution network
CN111753101B (zh) 一种融合实体描述及类型的知识图谱表示学习方法
Quteineh et al. Textual data augmentation for efficient active learning on tiny datasets
US11373117B1 (en) Artificial intelligence service for scalable classification using features of unlabeled data and class descriptors
CN105593849A (zh) 数据库访问
Xu et al. Mdan: Multi-level dependent attention network for visual emotion analysis
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
JP2015230570A (ja) 学習モデル作成装置、判定システムおよび学習モデル作成方法
WO2021238279A1 (zh) 数据分类方法、分类器训练方法及***
Li et al. Publication date estimation for printed historical documents using convolutional neural networks
WO2023093372A1 (zh) 生成文本的方法和装置
CN109062958B (zh) 一种基于TextRank和卷积神经网络的小学作文自动分类方法
CN110134965A (zh) 用于信息处理的方法、装置、设备和计算机可读存储介质
WO2020135054A1 (zh) 视频推荐方法、装置、设备及存储介质
Eyal et al. Large scale substitution-based word sense induction
Bi et al. Simple or complex? complexity-controllable question generation with soft templates and deep mixture of experts model
CN115860009B (zh) 一种引入辅助样本进行对比学习的句子嵌入方法及***
Revindasari et al. Traceability between business process and software component using Probabilistic Latent Semantic Analysis
Cai et al. Semantic and correlation disentangled graph convolutions for multilabel image recognition
CN109657013A (zh) 一种***化生成标签的方法和***
Singh et al. Visual content generation from textual description using improved adversarial network
CN114462673A (zh) 用于预测未来事件的方法、***、计算设备和可读介质
Wang et al. Centermatch: A center matching method for semi-supervised facial expression recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897455

Country of ref document: EP

Kind code of ref document: A1