CN111859950A

CN111859950A - Method for automatically generating lecture notes

Info

Publication number: CN111859950A
Application number: CN202010559615.3A
Authority: CN
Inventors: 王子奕; 王文广; 陈运文; 贺梦洁; 王忠萌; 纪达麒
Original assignee: Datagrand Tech Inc
Current assignee: Datagrand Tech Inc
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2020-10-30

Abstract

The invention discloses a method for automatically generating lecture notes, which comprises the following steps: the method comprises the steps of obtaining related texts from the Internet according to subject words of a lecture, processing the texts to generate candidate documents classified according to keywords, finding a plurality of corresponding candidate documents according to preset keywords of a lecture paragraph, and sampling sentences from the candidate documents to generate lecture paragraph contents. The method and the device can quickly generate the natural language text with high quality, good readability and certain space.

Description

Method for automatically generating lecture notes

Technical Field

The invention belongs to the field of text processing, and particularly relates to a method for automatically generating a lecture manuscript.

Background

With the rapid development of natural language understanding, researchers are increasingly focusing on the core domain of text generation, a natural language process. From the perspective of task input, text generation can be roughly classified into four broad categories, text-to-text generation, meaning-to-text generation, data-to-text generation, and image-to-text generation. The generation of the lecture notes is closer to the second and third categories in most scenes, and only a small amount of input information needs to be provided by a user, so that the system can automatically generate the text meeting the constraint condition.

Unlike the sentence compression task of filtering redundant components and retaining important information, the input generated by the lecture often only contains few semantic fragments, and the output result requires a natural language text with high quality, good readability and a certain length, which makes the technology challenging. Extracting only semantic representations from user input is quite ineffective at solving this type of problem, and therefore typically relies on a large amount of external information.

The template method is a method commonly used in lecture generation, a lecture fragment for filling by a user is reserved, text generation can be quickly realized, the requirement on manpower for maintaining a large number of templates is high, and the diversity of generated contents is still difficult to guarantee due to the changeability of themes. The generation model based on deep learning has the defects of low decoding efficiency, uncontrollable result and the like, and in practice, only a small amount of labeled data in a specific field is always available, so that the supervised learning effect is not high.

Disclosure of Invention

The invention provides a method for automatically generating a lecture, which aims at solving the problems in the prior art, and partial embodiments of the invention can extract sentences fitting with given subjects and keywords from a large amount of linguistic data to organize the sentences into complete chapters, thereby not only overcoming the problem of diversity deficiency caused by the traditional template method to a certain extent, but also solving the problem of uncontrollable output results caused by a generated model. The invention comprises the following steps: s1, starting a crawler module to request appointed URLs to download original corpora; s2, extracting a plurality of sentences with highest scores from the corpus, mapping the sentences to preset subject keywords through a certain rule, and storing the candidate documents into a database as a candidate document; and S3, analyzing parameters such as the theme of the lecture, paragraph keywords, the number of words of the paragraphs and the like according to the user configuration information, randomly sampling candidate documents and candidate sentences under certain constraint conditions to form paragraphs, and splicing the paragraphs to serve as the final lecture for output.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method of automatically generating a lecture, the method comprising the steps of:

the method comprises the steps of obtaining related texts from the Internet according to subject words of a lecture, processing the texts to generate candidate documents classified according to keywords, finding a plurality of corresponding candidate documents according to preset keywords of a lecture paragraph, and sampling sentences from the candidate documents to generate lecture paragraph contents.

The processing the text comprises: combining Word2Vec and TextRank algorithm to keep a plurality of sentences with the highest importance scores in the text as a candidate document

The processing the text comprises the following steps:

1) sentence dividing is carried out on the text;

2) performing Word segmentation on each sentence, and taking the mean value of Word2Vec Word vectors as the semantic representation of the sentence;

3) calculating a sentence similarity matrix;

4) solving the importance score of each sentence according to a TextRank iterative formula and sequencing the sentences from high to low;

5) and taking a plurality of sentences with the highest importance scores to form a candidate document.

The method for sampling sentences from the candidate document to generate the lecture paragraph content comprises the following steps:

1) initializing the paragraph set to be empty;

2) querying all candidate documents corresponding to the keywords of the section in a database;

3) Randomly sampling a candidate document from the candidate document set, and removing the candidate document from the candidate document set;

4) randomly sampling a sentence from the candidate documents selected in the step 3) and adding the sentence into the paragraph set;

5) judging whether the number of words in the paragraph sentence set reaches the maximum limit, if so, ending the process, and successively splicing the sentences in the paragraph sentence set to output paragraph contents; otherwise, performing step 6);

6) judging whether the previous sentence sampled is the tail sentence of the candidate document, if so, returning to the step 3); otherwise, performing step 7);

7) and continuously sampling the next sentence from the content behind the previously selected sentence in the current candidate document with a preset probability p, adding the next sentence into the paragraph sentence set, jumping out of the current candidate document with the probability 1-p, and returning to the step 3).

An electronic device, comprising:

a processor; and

a memory having stored therein processor-executable instructions;

wherein the processor implements any of the methods by executing the executable instructions.

A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the methods.

A system for automatically generating a lecture script, the system comprising:

The network crawler module acquires related texts from the Internet according to subject words of the lecture draft;

an information extraction module that processes the text to generate candidate documents classified by keywords; and

the system comprises a lecture generation module, a database and a database, wherein the lecture generation module finds a plurality of corresponding candidate documents according to preset keywords of a lecture paragraph and samples sentences from the candidate documents to generate lecture paragraph contents.

Compared with the prior art, the invention has the beneficial effects that: the method can quickly generate natural language texts with high quality, good readability and certain space.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic overall flow chart of the embodiment of the present invention.

Fig. 2 is a schematic flow chart of the information extraction module.

Fig. 3 is a schematic flow chart of the lecture generation module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1 to 3, in the embodiment of the present invention, important information under a corresponding topic is first extracted from a large number of lecture manuscript templates and news-time corpus on a network and stored in a database, then appropriate sentences are selected from candidate documents expressing specific topics and keywords in a combination of rules and sampling according to user configuration to combine into paragraphs, and the paragraphs are spliced to form a lecture, which includes:

1) the web crawler module downloads the original corpus according to the subject terms and the URLs;

2) the information extraction module is used for extracting key information from the unstructured text;

3) and the lecture generation module is used for sampling the document and the sentence to generate an article.

The user-provided inputs mainly include: the theme of the lecture, the keywords of each paragraph and the number of words. The system starts the crawler module (optional), the extraction module (optional) and the generation module in sequence, and finally returns a lecture generation result.

Web crawler module

The user can add subject words and webpage URLs related to the subject words by himself and start the crawler module to update the corpus periodically. The downloaded plain texts are stored in an XML format, each plain text is endowed with a specific keyword through certain rule matching, and the process can also be finished by a labeling person.

The inputs received by the crawler module include: and the subject words, the URLs and the keyword matching rules are output as texts containing the subject words and the keyword information.

Information extraction module

The information extraction module is mainly used for extracting important information in the corpus obtained in the previous step. For each text, a Word2Vec + TextRank algorithm is adopted to keep a plurality of sentences with the highest importance scores as a candidate document for expressing the topic keywords.

The TextRank is a graph algorithm, each sentence in the paragraph is regarded as a node of the graph, if the two sentences have similarity, a non-directional weighted edge is considered to exist between the two corresponding nodes, and the weight is the sentence similarity. When message transmission is carried out based on the graph and reaches a steady state (a threshold value is set, and when the absolute value of the transformation quantity of the node score after the iteration of the current round relative to the score of the previous round is lower than the threshold value, the importance score of each sentence can be considered to reach a steady state), the iterative calculation formula of the sentence score is as follows:

Describing sentence similarity w_jiIn the above, the original TextRank algorithm is based on the number of co-occurring words. For example, node v_iAnd v_jThe similarity of (d) is calculated by:

the system extraction module introduces Word2Vec pre-training Word vectors, uses the cosine of an included angle between two sentence characteristic representations as similarity, and obtains a better effect in the future under the condition of prior knowledge:

the input received by the information extraction module comprises a text with subject word and keyword information, and the output is a compressed text, and the main steps are as follows:

sentence dividing is carried out on the original text;

performing Word segmentation on each sentence, and taking the mean value of Word2Vec Word vectors as the semantic representation of the sentence;

calculating a sentence similarity matrix;

solving the importance score of each sentence according to a TextRank iterative formula and sequencing the sentences from high to low;

and taking Top-N sentences to form a document.

Lecture manuscript generating module

The generation module is used for analyzing the user configuration, including the selected theme, the keywords selected for each paragraph, the number of words of the paragraph and the like, and outputting the lecture chapters formed by the paragraphs. The paragraph generation is carried out according to the following steps:

1) the paragraph sentence set is initialized to be empty;

3) Randomly sampling a document from the candidate document set and removing the document from the candidate document set;

4) randomly sampling a sentence from the document selected in the step 3) and adding the sentence into the sentence set of the paragraph;

5) judging whether the number of words in the paragraph sentence set reaches the maximum limit, if so, ending the process, and successively splicing sentences in the set to output paragraph contents; otherwise, performing step 6);

6) judging whether the previous sentence sampled is the tail sentence of the document to which the previous sentence belongs, if so, returning to the step 3); otherwise, performing step 7);

continuously sampling the next sentence from the content behind the previously selected sentence in the current document by the probability p, adding the next sentence into the paragraph sentence set, jumping out of the current document by the probability 1-p, and returning to the step 3);

further, the user inputs the theme of the lecture, which comprises a natural segment, and determines the keywords of the paragraph, wherein the word number is 300. The following operations are performed:

1) whether to start a crawler module to obtain the linguistic data related to the subject is selected, subject words are searched in some search engines, URLs in returned results are captured, or the keywords are designated by a user. After the linguistic data expressing the theme are downloaded, the linguistic data are divided into specific keywords according to some rule mapping;

2) Selecting whether to start an information extraction module to extract key information from the corpus in the step 1).

Taking the obtained segment as a candidate document for generating the lecture, and writing the candidate document, the subject word and the keyword information into a database;

3) and starting a lecture generation module to generate texts by taking paragraphs as units. Because the embodiment has only one paragraph, all documents with the same subject and the same key words as the input subjects of the user are inquired from the database, and the sampled documents and sentences are spliced into the paragraph and output as the generated result of the lecture.

Although the present invention has been described in detail with respect to the above embodiments, it will be understood by those skilled in the art that modifications or improvements based on the disclosure of the present invention may be made without departing from the spirit and scope of the invention, and these modifications and improvements are within the spirit and scope of the invention.

Claims

1. A method for automatically generating lecture notes, which is characterized by comprising the following steps: the method comprises the steps of obtaining related texts from the Internet according to subject words of a lecture, processing the texts to generate candidate documents classified according to keywords, finding a plurality of corresponding candidate documents according to preset keywords of a lecture paragraph, and sampling sentences from the candidate documents to generate lecture paragraph contents.

2. The method of automatically generating a lecture according to claim 1, wherein the processing the text comprises: and (4) combining the Word2Vec algorithm and the TextRank algorithm to reserve a plurality of sentences with the highest importance scores in the text as a candidate document.

3. The method of automatically generating a lecture according to claim 2, wherein the processing the text comprises the steps of:

1) sentence dividing is carried out on the text;

3) calculating a sentence similarity matrix;

4. The method for automatically generating a lecture according to claim 1, wherein the step of sampling sentences from the candidate document to generate lecture paragraph contents comprises the steps of:

1) initializing the paragraph set to be empty;

5. An electronic device, comprising:

a processor; and

a memory having stored therein processor-executable instructions;

wherein the processor implements the method of any of claims 1-4 by executing the executable instructions.

6. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1-4.

7. A system for automatically generating lectures, the system comprising: