CN111414735B - Text data generation method and device - Google Patents
Text data generation method and device Download PDFInfo
- Publication number
- CN111414735B CN111414735B CN202010166957.9A CN202010166957A CN111414735B CN 111414735 B CN111414735 B CN 111414735B CN 202010166957 A CN202010166957 A CN 202010166957A CN 111414735 B CN111414735 B CN 111414735B
- Authority
- CN
- China
- Prior art keywords
- text data
- theme
- sentences
- deep learning
- learning model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000013136 deep learning model Methods 0.000 claims abstract description 71
- 238000012545 processing Methods 0.000 claims description 16
- 238000012549 training Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000009193 crawling Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000004308 accommodation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000006386 memory function Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application discloses a method and a device for generating text data. The method comprises the following steps: acquiring text data corresponding to a preset theme, and obtaining a sample data set of the theme; establishing a deep learning model for generating text data of the theme by using the sample data set of the theme, wherein the deep learning model records language logic expression relations; and after receiving a text generation request for the theme, generating corresponding text data by using a deep learning model of the theme.
Description
Technical Field
The embodiment of the application relates to the field of information processing, in particular to a method and a device for generating text data.
Background
In business scenes of electronic commerce industry and new media, a great deal of text information such as news, brief introduction, soft text and the like is needed as an important basis for information propagation and diffusion. In the related art, based on a given text set, splicing and integrating existing text data in a random sampling manner from the text set, so as to generate a result text of specified content; or, the text is segmented and then randomly selected and integrated to generate the text.
The text generated by adopting random sampling has poor readability, and the content has high content repeatability with other texts.
Disclosure of Invention
In order to solve any technical problem, the embodiment of the application provides a method and a device for generating text data.
In order to achieve the purpose of the embodiment of the present application, the embodiment of the present application provides a method for generating text data, including:
acquiring text data corresponding to a preset theme, and obtaining a sample data set of the theme;
establishing a deep learning model for generating text data of the theme by using the sample data set of the theme, wherein the deep learning model records language logic expression relations;
and after receiving a text generation request for the theme, generating corresponding text data by using a deep learning model of the theme.
In an exemplary embodiment, the obtaining text data corresponding to a preset theme includes:
acquiring text data on a website by using a preset text acquisition tool;
classifying the acquired text data according to the subject of the text data to obtain the text data corresponding to the subject.
In an exemplary embodiment, the creating a deep learning model for generating text data of the subject using the sample data set of the subject includes:
identifying language logic expression relations among words and sentences in each text data in the sample data set;
according to the language logic expression relationship, performing cross combination on words and sentences in at least two text data to obtain new words and sentences;
and establishing the deep learning model according to the language logic expression relation and the new words and sentences.
In an exemplary embodiment, the generating corresponding text data using the deep learning model of the topic includes:
acquiring keyword information of the theme;
inquiring target words and sentences conforming to the preset description content of the keywords from words and sentences stored in advance in the deep learning model of the theme;
and controlling the deep learning model to arrange and combine the target words and sentences according to the language logic expression relation obtained in advance to obtain text data corresponding to the keywords.
In an exemplary embodiment, before the controlling the deep learning model performs permutation and combination on the target words and phrases according to the pre-acquired language logic expression relationship, the method further includes:
acquiring target text data comprising the keywords;
and identifying the target text data by using the deep learning model to obtain the language logic expression relation of the target text information, and carrying out cross combination on words and sentences in at least two target text data according to the language logic expression relation of the target text information to obtain new target words and sentences.
A text data generating apparatus comprising:
the acquisition module is used for acquiring text data corresponding to a preset theme and obtaining a sample data set of the theme;
the system comprises a creating module, a data processing module and a data processing module, wherein the creating module is used for creating a deep learning model for generating text data of the theme by utilizing a sample data set of the theme, wherein the deep learning model records a language logic expression relation;
and the generation module is used for generating corresponding text data by utilizing the deep learning model of the theme after receiving the text generation request of the theme.
In one exemplary embodiment, the acquisition module includes:
the acquisition unit is used for acquiring text data on a website by using a preset text acquisition tool;
and the classification unit is used for classifying the acquired text data according to the subject of the text data to obtain the text data corresponding to the subject.
In one exemplary embodiment, the establishing module includes:
the identifying unit is used for identifying the language logic expression relation among words and sentences in each text data in the sample data set;
the combination unit is used for carrying out cross combination on words and sentences in at least two text data according to the language logic expression relationship to obtain new words and sentences;
and the establishing unit is used for establishing the deep learning model according to the language logic expression relation and the new words and sentences.
In one exemplary embodiment, the generating module includes:
a first obtaining unit, configured to obtain keyword information of the subject;
the query unit is used for acquiring the keyword information of the theme; inquiring target words and sentences conforming to the preset description content of the keywords from words and sentences stored in advance in the deep learning model of the theme;
and the control unit is used for controlling the deep learning model to arrange and combine the target words and sentences according to the language logic expression relation acquired in advance to obtain text data corresponding to the keywords.
In an exemplary embodiment, the generating module further includes:
the second acquisition unit is used for acquiring target text data comprising the keywords before the target words and sentences are arranged and combined;
and the processing unit is used for identifying the target text data by utilizing the deep learning model to obtain the language logic expression relation of the target text information, and carrying out cross combination on words and sentences in at least two target text data according to the language logic expression relation of the target text information to obtain new target words and sentences.
According to the scheme provided by the embodiment of the application, the sample data set of the theme is obtained by obtaining the text data corresponding to the preset theme, the deep learning model for generating the text data of the theme is established by utilizing the sample data set of the theme, the language logic expression relation is recorded in the deep learning model, after a text generation request for the theme is received, the corresponding text data is generated by utilizing the deep learning model of the theme, the deep learning model is obtained by training in the obtained sample data set, the text of the appointed theme is generated by utilizing the language logic expression relation in the deep learning model, and the readability of the text is improved.
Additional features and advantages of embodiments of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of embodiments of the application. The objectives and other advantages of the embodiments of the present application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the technical solutions of the embodiments of the present application, and are incorporated in and constitute a part of this specification, illustrate the technical solutions of the embodiments of the present application and not constitute a limitation to the technical solutions of the embodiments of the present application.
Fig. 1 is a flowchart of a method for generating text data according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an LSTM framework of layers provided by embodiments of the present application;
fig. 3 is a block diagram of a text data generating apparatus according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in detail hereinafter with reference to the accompanying drawings. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be arbitrarily combined with each other.
In the process of implementing the present application, the inventors found that the random concatenation generation method based on the given text in the related art has the following problems:
1. limitations in generating the resulting content: since existing schemes are all derived from a given set of text, the results generated are limited to such, and the generated text is highly repeatable with fewer sets of text.
2. The consistency of the generated text is poor: based on a random splice generation method, the consistency of the upper and lower semantics of the text is not considered, so that the result is always inconsistent and poor in readability.
In summary, the result of the random splice generation method based on a given text is poor, and such limitations and discontinuities are difficult to eliminate from a technical point of view, so the present application proposes the following solution:
fig. 1 is a flowchart of a method for generating text data according to an embodiment of the present application. As shown in fig. 1, the method shown in fig. 1 includes:
step 101, acquiring text data corresponding to a preset theme, and obtaining a sample data set of the theme;
in one exemplary embodiment, the subject is a particular word, which is an expanded description of the text content around that word; for example, a subject word corresponding to text data describing a river may be the name of the river.
In an exemplary embodiment, the text data corresponding to the subject may be manually read, extracted, or acquired with a preset text reading tool.
In an exemplary embodiment, the obtaining text data corresponding to a preset theme includes:
acquiring text data on a website by using a preset text acquisition tool;
classifying the acquired text data according to the subject of the text data to obtain the text data corresponding to the subject.
The text collection tool may be a web crawler tool;
the text collection tool is utilized to acquire data of a preset website in real time to obtain text data, the text data are classified according to the theme to obtain the text data of the theme, real-time updating of a sample data set can be completed, and the repeatability of the subsequently generated text data is reduced.
102, establishing a deep learning model for generating text data of the theme by utilizing a sample data set of the theme, wherein the deep learning model records a language logic expression relation;
aiming at the problem of the intersection of the readability of the randomly generated texts in the related technology, the expression rule of the language logic can be obtained through training the sample data set, and an operation basis is provided for the subsequent generation of the readable texts.
In an exemplary embodiment, the creating a deep learning model for generating text data of the subject using the sample data set of the subject includes:
identifying language logic expression relations among words and sentences in each text data in the sample data set;
according to the language logic expression relationship, performing cross combination on words and sentences in at least two text data to obtain new words and sentences;
and establishing the deep learning model according to the language logic expression relation and the new words and sentences.
After the language logic expression relation is obtained, words and sentences in at least two text data can be cross-combined, so that new words and sentences which have the existing readability and are not repeated are obtained, and data support is provided for subsequent generation of new text data.
The cross combination mode comprises the following steps:
selecting at least one word from each text data, and combining to obtain a new combination; or,
at least one word is selected from each text data, the combination is carried out, the combined content is modified, and the modified combination is used as a new combination.
For example, the content used for describing the word X in the text 1 is ABC, the content used for describing the word X in the text 2 is DEF, and the content used for describing the word X in the text 3 is GH, wherein a to H each represent a different term; after the contents in the 3 texts are cross-combined, the contents used for describing the word X can be various expressed contents such as AEH, CFG and the like. Alternatively, when various expression contents such as AEH and CFG are obtained, a certain content of the new combination may be adjusted, for example, a in AEH is modified to obtain a ', and then combined with EH to obtain a' EH.
As can be seen from the above examples, the repetition of the text data after recombination with the previous text data is significantly reduced.
And 103, after receiving a text generation request for the theme, generating corresponding text data by using a deep learning model of the theme.
In an exemplary embodiment, by means of the language logical expression relationship recorded in the deep learning model, new text data is generated by using the text data in the collected sample data set, and the content in the new text data can be ensured to conform to the language logical expression relationship, so that the method has readability.
In an exemplary embodiment, the generating corresponding text data using the deep learning model of the topic includes:
acquiring keyword information of the theme;
inquiring target words and sentences conforming to the preset description content of the keywords from words and sentences stored in advance in the deep learning model of the theme;
and controlling the deep learning model to arrange and combine the target words and sentences according to the language logic expression relation obtained in advance to obtain text data corresponding to the keywords.
The key word information can be used for limiting the theme, determining the direction of the focused description of the theme, and when the theme word is a certain scenic spot, the key word information can be a human history, a landscape, a catering accommodation and the like. And screening the pre-stored words and sentences by utilizing keywords, so that the generated text data is more close to the requirements of users, and the content accuracy of the text data is improved.
In an exemplary embodiment, before the controlling the deep learning model performs permutation and combination on the target words and phrases according to the pre-acquired language logic expression relationship, the method further includes:
acquiring target text data comprising the keywords;
and identifying the target text data by using the deep learning model to obtain the language logic expression relation of the target text information, and carrying out cross combination on words and sentences in at least two target text data according to the language logic expression relation of the target text information to obtain new target words and sentences.
The text data of the target file comprising the keywords can be more in line with the requirements of users, and the text data more in line with the requirements of the users can be obtained by analyzing the text data of the target file.
According to the method provided by the embodiment of the application, the sample data set of the theme is obtained by obtaining the text data corresponding to the preset theme, the deep learning model for generating the text data of the theme is established by utilizing the sample data set of the theme, the language logic expression relation is recorded in the deep learning model, after a text generation request for the theme is received, the corresponding text data is generated by utilizing the deep learning model of the theme, the deep learning model is obtained by training in the obtained sample data set, the text of the appointed theme is generated by utilizing the language logic expression relation in the deep learning model, and the readability of the text is improved.
The following describes the method provided in the embodiment of the present application:
the LSTM (Long Short Term Memory, long and short term memory) algorithm used in the invention belongs to one of RNN cyclic neural networks, is good at processing and analyzing events with longer intervals and delays in a time sequence, and is generally used for the problems of speech recognition, language translation, stock prediction and the like. The method comprises the steps of automatically crawling text data containing specified contents, performing processes such as cleaning and integration on the data to obtain a data set, training on the data set to obtain an LSTM model, and generating a target text with specified length under the given content condition by the model to complete text generation based on the LSTM.
The method comprises the key steps of data acquisition, data processing, model training and text generation, and the complete generation method is finally completed by combining specific service conditions after the trained LSTM model is obtained. The method specifically comprises the following steps:
1. a data acquisition step: automatically acquiring a data set by using a Python crawler, and crawling text data on a certain website;
and the Python crawlers are utilized to acquire data, so that comments on a preset website can be crawled. The method can be realized by using a current module of Python, and 10 threads are called to crawl data and stored as a csv file format.
2. And a data processing step: cleaning the acquired data set, removing irregular text contents such as special characters, punctuation, line feed and the like, unifying text formats, and dividing a training set and a verification set of data;
and for the crawled csv file, special characters in the text are required to be removed, so that the generated model is prevented from being influenced. Special characters such as punctuation, line feed, space and the like in the text can be removed by using a regular expression method, and the existing wrongly written characters can be modified. After processing, the characters in the obtained text set are used as a sample data set for establishing an LSTM model in a subsequent step.
3. Model training: constructing an LSTM model based on the obtained data set, and completing training of the model;
the deep learning model is created by adopting Keras, wherein Keras is an advanced neural network API, supports TensorFlow, theano and CNTK, can easily build some complex neural network models by Keras, and is a deep learning framework which is relatively common and has relatively good performance.
In order to automatically generate a text of comment content with high readability, the method builds an LSTM model for the comment data set obtained in the last step.
In the process of constructing an LSTM model, each word is firstly required to be corresponding to a number, the input feature vector of the model is a vector composed of numbers corresponding to the first 10 words, and the target variable is the number corresponding to the next word of the 10 words. A total of 1949 characters (including kanji and punctuation marks) are processed in the txt file, 41402 samples are obtained according to the processing mode, and the samples are transmitted into the LSTM model.
The LSTM model is established as follows:
fig. 2 is a schematic diagram of an LSTM framework of a layer provided in an embodiment of the present application. As shown in fig. 2, there are 128 hidden layer nodes per layer in the LSTM framework of 2 layers, and the batch_size is set to 64 (i.e., 64 samples are taken for training at a time); creating a Dropout layer, which can effectively relieve the phenomenon of overfitting under the condition of too many model parameters but less sample data, and achieves the regularization effect to a certain extent; the last layer is the Softmax layer, which translates it into a multi-classification problem, where the loss function is calculated here using cross entropy, updating the model parameters by back propagation.
The analysis of text data is accomplished by the following advantages by LSTM, including:
(1) The accuracy for classification is high;
(2) The algorithm has strong parallel processing capability and can be used for distributed storage and learning;
(3) The method has good fault tolerance and robustness to noise nerves in the data, and can fully approximate to complex nonlinear relations;
(4) The text processing method has the long-short-term memory function, can process the internal connection between the texts, and avoids processing the texts into individual individuals.
4. A text generation step: and generating a target text by using the model obtained through training.
After training to obtain the LSTM model, the training model is used to generate result text. Since the feature vector inputted in the training is a vector corresponding to the first 10 characters, the character inputted in the generation stage is also 10 characters in length.
The language logic expression relationship in the LSTM model is utilized, so that the normal language use specification of human beings is met, and the readability is improved; second, comparing the generated text with the text set in the sample data, not the complete replication and editing of the sample, but rather the selective organization of logically linked text in the data set, reduces the degree of duplication of text.
When the data set is limited, the similarity among generated texts obtained by multiple times of training is higher, but when the number of automatically acquired samples is continuously increased, the similarity among generated result texts can be reduced, and a more ideal training result is obtained.
From the above, the method and the device can be suitable for business scenes generated by automatic texts with various different themes by automatically crawling texts and expanding a sample data set; and training the LSTM model from the automatically acquired data set to obtain the LSTM model, and generating the target text with better readability of the specified subject.
Fig. 3 is a block diagram of a text data generating apparatus according to an embodiment of the present application. As shown in fig. 3, the apparatus shown in fig. 3 includes:
the acquisition module is used for acquiring text data corresponding to a preset theme and obtaining a sample data set of the theme;
the system comprises a creating module, a data processing module and a data processing module, wherein the creating module is used for creating a deep learning model for generating text data of the theme by utilizing a sample data set of the theme, wherein the deep learning model records a language logic expression relation;
and the generation module is used for generating corresponding text data by utilizing the deep learning model of the theme after receiving the text generation request of the theme.
In one exemplary embodiment, the acquisition module includes:
the acquisition unit is used for acquiring text data on a website by using a preset text acquisition tool;
and the classification unit is used for classifying the acquired text data according to the subject of the text data to obtain the text data corresponding to the subject.
In one exemplary embodiment, the establishing module includes:
the identifying unit is used for identifying the language logic expression relation among words and sentences in each text data in the sample data set;
the combination unit is used for carrying out cross combination on words and sentences in at least two text data according to the language logic expression relationship to obtain new words and sentences;
and the establishing unit is used for establishing the deep learning model according to the language logic expression relation and the new words and sentences.
In one exemplary embodiment, the generating module includes:
a first obtaining unit, configured to obtain keyword information of the subject;
the query unit is used for querying target words and sentences conforming to the preset description content of the keywords from words and sentences stored in advance in the deep learning model of the theme;
and the control unit is used for controlling the deep learning model to arrange and combine the target words and sentences according to the language logic expression relation acquired in advance to obtain text data corresponding to the keywords.
In an exemplary embodiment, the generating module further includes:
the second acquisition unit is used for acquiring target text data comprising the keywords before the target words and sentences are arranged and combined;
and the processing unit is used for identifying the target text data by utilizing the deep learning model to obtain the language logic expression relation of the target text information, and carrying out cross combination on words and sentences in at least two target text data according to the language logic expression relation of the target text information to obtain new target words and sentences.
According to the device provided by the embodiment of the application, the sample data set of the theme is obtained by obtaining the text data corresponding to the preset theme, the deep learning model for generating the text data of the theme is established by utilizing the sample data set of the theme, the language logic expression relation is recorded in the deep learning model, after a text generation request for the theme is received, the corresponding text data is generated by utilizing the deep learning model of the theme, the deep learning model is obtained by training in the obtained sample data set, the text of the appointed theme is generated by utilizing the language logic expression relation in the deep learning model, and the readability of the text is improved.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
Claims (6)
1. A method for generating text data, comprising:
acquiring text data corresponding to a preset theme, and obtaining a sample data set of the theme;
establishing a deep learning model for generating text data of the theme by using the sample data set of the theme, wherein the deep learning model records language logic expression relations;
after receiving a text generation request for the theme, generating corresponding text data by using a deep learning model of the theme;
the creating a deep learning model for generating text data of the subject by using the sample data set of the subject includes:
identifying language logic expression relations among words and sentences in each text data in the sample data set;
according to the language logic expression relationship, performing cross combination on words and sentences in at least two text data to obtain new words and sentences;
establishing the deep learning model according to the language logic expression relation and the new words and sentences;
the generating corresponding text data by using the deep learning model of the theme comprises the following steps:
acquiring keywords of the theme;
inquiring target words and sentences conforming to the preset description content of the keywords from words and sentences stored in advance in the deep learning model of the theme;
and controlling the deep learning model to arrange and combine the target words and sentences according to the language logic expression relation obtained in advance to obtain text data corresponding to the keywords.
2. The method according to claim 1, wherein the obtaining text data corresponding to a preset theme includes:
acquiring text data on a website by using a preset text acquisition tool;
classifying the acquired text data according to the subject of the text data to obtain the text data corresponding to the subject.
3. The method of claim 1, wherein prior to controlling the deep learning model to rank the target phrases according to the pre-acquired linguistic logic expression relationships, the method further comprises:
acquiring target text data comprising the keywords;
and identifying the target text data by using the deep learning model to obtain the language logic expression relation of the target text data, and carrying out cross combination on words and sentences in at least two target text data according to the language logic expression relation of the target text data to obtain new target words and sentences.
4. A text data generating apparatus, comprising:
the acquisition module is used for acquiring text data corresponding to a preset theme and obtaining a sample data set of the theme;
the system comprises a creating module, a data processing module and a data processing module, wherein the creating module is used for creating a deep learning model for generating text data of the theme by utilizing a sample data set of the theme, wherein the deep learning model records a language logic expression relation;
the generation module is used for generating corresponding text data by utilizing the deep learning model of the theme after receiving the text generation request of the theme;
wherein, the establishment module includes:
the identifying unit is used for identifying the language logic expression relation among words and sentences in each text data in the sample data set;
the combination unit is used for carrying out cross combination on words and sentences in at least two text data according to the language logic expression relationship to obtain new words and sentences;
the establishing unit is used for establishing the deep learning model according to the language logic expression relation and the new words and sentences;
wherein, the generating module includes:
a first obtaining unit, configured to obtain keywords of the subject;
the query unit is used for querying target words and sentences conforming to the preset description content of the keywords from words and sentences stored in advance in the deep learning model of the theme;
and the control unit is used for controlling the deep learning model to arrange and combine the target words and sentences according to the language logic expression relation acquired in advance to obtain text data corresponding to the keywords.
5. The apparatus of claim 4, wherein the acquisition module comprises:
the acquisition unit is used for acquiring text data on a website by using a preset text acquisition tool;
and the classification unit is used for classifying the acquired text data according to the subject of the text data to obtain the text data corresponding to the subject.
6. The apparatus of claim 4, wherein the generating module further comprises:
the second acquisition unit is used for acquiring target text data comprising the keywords before the target words and sentences are arranged and combined;
and the processing unit is used for identifying the target text data by utilizing the deep learning model to obtain the language logic expression relation of the target text data, and carrying out cross combination on words and sentences in at least two target text data according to the language logic expression relation of the target text data to obtain new target words and sentences.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010166957.9A CN111414735B (en) | 2020-03-11 | 2020-03-11 | Text data generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010166957.9A CN111414735B (en) | 2020-03-11 | 2020-03-11 | Text data generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111414735A CN111414735A (en) | 2020-07-14 |
CN111414735B true CN111414735B (en) | 2024-03-22 |
Family
ID=71491096
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010166957.9A Active CN111414735B (en) | 2020-03-11 | 2020-03-11 | Text data generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111414735B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111984845B (en) * | 2020-08-17 | 2023-10-31 | 江苏百达智慧网络科技有限公司 | Website wrongly written word recognition method and system |
CN112699643B (en) * | 2020-12-23 | 2024-04-19 | 车智互联(北京)科技有限公司 | Method for generating language model and automatic article generation method |
CN117033934B (en) * | 2023-08-02 | 2024-04-19 | 中信联合云科技有限责任公司 | Content generation method and device based on artificial intelligence |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107644085A (en) * | 2017-09-22 | 2018-01-30 | 百度在线网络技术(北京)有限公司 | The generation method and device of competitive sports news |
CN107797982A (en) * | 2016-08-31 | 2018-03-13 | 百度在线网络技术(北京)有限公司 | For identifying the method, apparatus and equipment of text type |
DE102019000433A1 (en) * | 2018-04-23 | 2019-10-24 | Adobe Inc. | Generate a topic-based summary of a text content |
CN110750975A (en) * | 2019-10-21 | 2020-02-04 | 北京明略软件***有限公司 | Introduction text generation method and device |
-
2020
- 2020-03-11 CN CN202010166957.9A patent/CN111414735B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107797982A (en) * | 2016-08-31 | 2018-03-13 | 百度在线网络技术(北京)有限公司 | For identifying the method, apparatus and equipment of text type |
CN107644085A (en) * | 2017-09-22 | 2018-01-30 | 百度在线网络技术(北京)有限公司 | The generation method and device of competitive sports news |
DE102019000433A1 (en) * | 2018-04-23 | 2019-10-24 | Adobe Inc. | Generate a topic-based summary of a text content |
CN110390009A (en) * | 2018-04-23 | 2019-10-29 | 奥多比公司 | Generate the summary based on theme of content of text |
CN110750975A (en) * | 2019-10-21 | 2020-02-04 | 北京明略软件***有限公司 | Introduction text generation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111414735A (en) | 2020-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111859960B (en) | Semantic matching method, device, computer equipment and medium based on knowledge distillation | |
CN111414735B (en) | Text data generation method and device | |
CN111831790B (en) | False news identification method based on low threshold integration and text content matching | |
US20200057807A1 (en) | Systems and methods providing a cognitive augmented memory network | |
CN112579733B (en) | Rule matching method, rule matching device, storage medium and electronic equipment | |
CN111339250A (en) | Mining method of new category label, electronic equipment and computer readable medium | |
CN109508448A (en) | Short information method, medium, device are generated based on long article and calculate equipment | |
Balli et al. | Sentimental analysis of Twitter users from Turkish content with natural language processing | |
US10595098B2 (en) | Derivative media content systems and methods | |
CN115221294A (en) | Dialogue processing method, dialogue processing device, electronic equipment and storage medium | |
US10499121B2 (en) | Derivative media content systems and methods | |
CN113934834A (en) | Question matching method, device, equipment and storage medium | |
CN117332789A (en) | Semantic analysis method and system for dialogue scene | |
CN110377706B (en) | Search sentence mining method and device based on deep learning | |
CN111736804A (en) | Method and device for identifying App key function based on user comment | |
CN112199954A (en) | Disease entity matching method and device based on voice semantics and computer equipment | |
CN115858776B (en) | Variant text classification recognition method, system, storage medium and electronic equipment | |
CN116909435A (en) | Data processing method and device, electronic equipment and storage medium | |
US20230035641A1 (en) | Multi-hop evidence pursuit | |
CN116055825A (en) | Method and device for generating video title | |
CN115964997A (en) | Confusion option generation method and device for choice questions, electronic equipment and storage medium | |
CN116976341A (en) | Entity identification method, entity identification device, electronic equipment, storage medium and program product | |
CN113946668A (en) | Semantic processing method, system and device based on edge node and storage medium | |
CN114741088A (en) | App source code linking method based on user comments and developer intelligence | |
CN110276001B (en) | Checking page identification method and device, computing equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |