CN111859950A - Method for automatically generating lecture notes - Google Patents

Method for automatically generating lecture notes Download PDF

Info

Publication number
CN111859950A
CN111859950A CN202010559615.3A CN202010559615A CN111859950A CN 111859950 A CN111859950 A CN 111859950A CN 202010559615 A CN202010559615 A CN 202010559615A CN 111859950 A CN111859950 A CN 111859950A
Authority
CN
China
Prior art keywords
lecture
sentence
paragraph
candidate
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010559615.3A
Other languages
Chinese (zh)
Inventor
王子奕
王文广
陈运文
贺梦洁
王忠萌
纪达麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Datagrand Tech Inc
Original Assignee
Datagrand Tech Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Datagrand Tech Inc filed Critical Datagrand Tech Inc
Priority to CN202010559615.3A priority Critical patent/CN111859950A/en
Publication of CN111859950A publication Critical patent/CN111859950A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for automatically generating lecture notes, which comprises the following steps: the method comprises the steps of obtaining related texts from the Internet according to subject words of a lecture, processing the texts to generate candidate documents classified according to keywords, finding a plurality of corresponding candidate documents according to preset keywords of a lecture paragraph, and sampling sentences from the candidate documents to generate lecture paragraph contents. The method and the device can quickly generate the natural language text with high quality, good readability and certain space.

Description

Method for automatically generating lecture notes
Technical Field
The invention belongs to the field of text processing, and particularly relates to a method for automatically generating a lecture manuscript.
Background
With the rapid development of natural language understanding, researchers are increasingly focusing on the core domain of text generation, a natural language process. From the perspective of task input, text generation can be roughly classified into four broad categories, text-to-text generation, meaning-to-text generation, data-to-text generation, and image-to-text generation. The generation of the lecture notes is closer to the second and third categories in most scenes, and only a small amount of input information needs to be provided by a user, so that the system can automatically generate the text meeting the constraint condition.
Unlike the sentence compression task of filtering redundant components and retaining important information, the input generated by the lecture often only contains few semantic fragments, and the output result requires a natural language text with high quality, good readability and a certain length, which makes the technology challenging. Extracting only semantic representations from user input is quite ineffective at solving this type of problem, and therefore typically relies on a large amount of external information.
The template method is a method commonly used in lecture generation, a lecture fragment for filling by a user is reserved, text generation can be quickly realized, the requirement on manpower for maintaining a large number of templates is high, and the diversity of generated contents is still difficult to guarantee due to the changeability of themes. The generation model based on deep learning has the defects of low decoding efficiency, uncontrollable result and the like, and in practice, only a small amount of labeled data in a specific field is always available, so that the supervised learning effect is not high.
Disclosure of Invention
The invention provides a method for automatically generating a lecture, which aims at solving the problems in the prior art, and partial embodiments of the invention can extract sentences fitting with given subjects and keywords from a large amount of linguistic data to organize the sentences into complete chapters, thereby not only overcoming the problem of diversity deficiency caused by the traditional template method to a certain extent, but also solving the problem of uncontrollable output results caused by a generated model. The invention comprises the following steps: s1, starting a crawler module to request appointed URLs to download original corpora; s2, extracting a plurality of sentences with highest scores from the corpus, mapping the sentences to preset subject keywords through a certain rule, and storing the candidate documents into a database as a candidate document; and S3, analyzing parameters such as the theme of the lecture, paragraph keywords, the number of words of the paragraphs and the like according to the user configuration information, randomly sampling candidate documents and candidate sentences under certain constraint conditions to form paragraphs, and splicing the paragraphs to serve as the final lecture for output.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method of automatically generating a lecture, the method comprising the steps of:
the method comprises the steps of obtaining related texts from the Internet according to subject words of a lecture, processing the texts to generate candidate documents classified according to keywords, finding a plurality of corresponding candidate documents according to preset keywords of a lecture paragraph, and sampling sentences from the candidate documents to generate lecture paragraph contents.
The processing the text comprises: combining Word2Vec and TextRank algorithm to keep a plurality of sentences with the highest importance scores in the text as a candidate document
The processing the text comprises the following steps:
1) sentence dividing is carried out on the text;
2) performing Word segmentation on each sentence, and taking the mean value of Word2Vec Word vectors as the semantic representation of the sentence;
3) calculating a sentence similarity matrix;
4) solving the importance score of each sentence according to a TextRank iterative formula and sequencing the sentences from high to low;
5) and taking a plurality of sentences with the highest importance scores to form a candidate document.
The method for sampling sentences from the candidate document to generate the lecture paragraph content comprises the following steps:
1) initializing the paragraph set to be empty;
2) querying all candidate documents corresponding to the keywords of the section in a database;
3) Randomly sampling a candidate document from the candidate document set, and removing the candidate document from the candidate document set;
4) randomly sampling a sentence from the candidate documents selected in the step 3) and adding the sentence into the paragraph set;
5) judging whether the number of words in the paragraph sentence set reaches the maximum limit, if so, ending the process, and successively splicing the sentences in the paragraph sentence set to output paragraph contents; otherwise, performing step 6);
6) judging whether the previous sentence sampled is the tail sentence of the candidate document, if so, returning to the step 3); otherwise, performing step 7);
7) and continuously sampling the next sentence from the content behind the previously selected sentence in the current candidate document with a preset probability p, adding the next sentence into the paragraph sentence set, jumping out of the current candidate document with the probability 1-p, and returning to the step 3).
An electronic device, comprising:
a processor; and
a memory having stored therein processor-executable instructions;
wherein the processor implements any of the methods by executing the executable instructions.
A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the methods.
A system for automatically generating a lecture script, the system comprising:
The network crawler module acquires related texts from the Internet according to subject words of the lecture draft;
an information extraction module that processes the text to generate candidate documents classified by keywords; and
the system comprises a lecture generation module, a database and a database, wherein the lecture generation module finds a plurality of corresponding candidate documents according to preset keywords of a lecture paragraph and samples sentences from the candidate documents to generate lecture paragraph contents.
Compared with the prior art, the invention has the beneficial effects that: the method can quickly generate natural language texts with high quality, good readability and certain space.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic overall flow chart of the embodiment of the present invention.
Fig. 2 is a schematic flow chart of the information extraction module.
Fig. 3 is a schematic flow chart of the lecture generation module.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1 to 3, in the embodiment of the present invention, important information under a corresponding topic is first extracted from a large number of lecture manuscript templates and news-time corpus on a network and stored in a database, then appropriate sentences are selected from candidate documents expressing specific topics and keywords in a combination of rules and sampling according to user configuration to combine into paragraphs, and the paragraphs are spliced to form a lecture, which includes:
1) the web crawler module downloads the original corpus according to the subject terms and the URLs;
2) the information extraction module is used for extracting key information from the unstructured text;
3) and the lecture generation module is used for sampling the document and the sentence to generate an article.
The user-provided inputs mainly include: the theme of the lecture, the keywords of each paragraph and the number of words. The system starts the crawler module (optional), the extraction module (optional) and the generation module in sequence, and finally returns a lecture generation result.
Web crawler module
The user can add subject words and webpage URLs related to the subject words by himself and start the crawler module to update the corpus periodically. The downloaded plain texts are stored in an XML format, each plain text is endowed with a specific keyword through certain rule matching, and the process can also be finished by a labeling person.
The inputs received by the crawler module include: and the subject words, the URLs and the keyword matching rules are output as texts containing the subject words and the keyword information.
Information extraction module
The information extraction module is mainly used for extracting important information in the corpus obtained in the previous step. For each text, a Word2Vec + TextRank algorithm is adopted to keep a plurality of sentences with the highest importance scores as a candidate document for expressing the topic keywords.
The TextRank is a graph algorithm, each sentence in the paragraph is regarded as a node of the graph, if the two sentences have similarity, a non-directional weighted edge is considered to exist between the two corresponding nodes, and the weight is the sentence similarity. When message transmission is carried out based on the graph and reaches a steady state (a threshold value is set, and when the absolute value of the transformation quantity of the node score after the iteration of the current round relative to the score of the previous round is lower than the threshold value, the importance score of each sentence can be considered to reach a steady state), the iterative calculation formula of the sentence score is as follows:
Figure RE-GDA0002658054870000051
Describing sentence similarity wjiIn the above, the original TextRank algorithm is based on the number of co-occurring words. For example, node viAnd vjThe similarity of (d) is calculated by:
Figure RE-GDA0002658054870000061
the system extraction module introduces Word2Vec pre-training Word vectors, uses the cosine of an included angle between two sentence characteristic representations as similarity, and obtains a better effect in the future under the condition of prior knowledge:
Figure RE-GDA0002658054870000062
Figure RE-GDA0002658054870000063
Figure RE-GDA0002658054870000064
the input received by the information extraction module comprises a text with subject word and keyword information, and the output is a compressed text, and the main steps are as follows:
sentence dividing is carried out on the original text;
performing Word segmentation on each sentence, and taking the mean value of Word2Vec Word vectors as the semantic representation of the sentence;
calculating a sentence similarity matrix;
solving the importance score of each sentence according to a TextRank iterative formula and sequencing the sentences from high to low;
and taking Top-N sentences to form a document.
Lecture manuscript generating module
The generation module is used for analyzing the user configuration, including the selected theme, the keywords selected for each paragraph, the number of words of the paragraph and the like, and outputting the lecture chapters formed by the paragraphs. The paragraph generation is carried out according to the following steps:
1) the paragraph sentence set is initialized to be empty;
2) querying all candidate documents corresponding to the keywords of the section in a database;
3) Randomly sampling a document from the candidate document set and removing the document from the candidate document set;
4) randomly sampling a sentence from the document selected in the step 3) and adding the sentence into the sentence set of the paragraph;
5) judging whether the number of words in the paragraph sentence set reaches the maximum limit, if so, ending the process, and successively splicing sentences in the set to output paragraph contents; otherwise, performing step 6);
6) judging whether the previous sentence sampled is the tail sentence of the document to which the previous sentence belongs, if so, returning to the step 3); otherwise, performing step 7);
continuously sampling the next sentence from the content behind the previously selected sentence in the current document by the probability p, adding the next sentence into the paragraph sentence set, jumping out of the current document by the probability 1-p, and returning to the step 3);
further, the user inputs the theme of the lecture, which comprises a natural segment, and determines the keywords of the paragraph, wherein the word number is 300. The following operations are performed:
1) whether to start a crawler module to obtain the linguistic data related to the subject is selected, subject words are searched in some search engines, URLs in returned results are captured, or the keywords are designated by a user. After the linguistic data expressing the theme are downloaded, the linguistic data are divided into specific keywords according to some rule mapping;
2) Selecting whether to start an information extraction module to extract key information from the corpus in the step 1).
Taking the obtained segment as a candidate document for generating the lecture, and writing the candidate document, the subject word and the keyword information into a database;
3) and starting a lecture generation module to generate texts by taking paragraphs as units. Because the embodiment has only one paragraph, all documents with the same subject and the same key words as the input subjects of the user are inquired from the database, and the sampled documents and sentences are spliced into the paragraph and output as the generated result of the lecture.
Although the present invention has been described in detail with respect to the above embodiments, it will be understood by those skilled in the art that modifications or improvements based on the disclosure of the present invention may be made without departing from the spirit and scope of the invention, and these modifications and improvements are within the spirit and scope of the invention.

Claims (7)

1. A method for automatically generating lecture notes, which is characterized by comprising the following steps: the method comprises the steps of obtaining related texts from the Internet according to subject words of a lecture, processing the texts to generate candidate documents classified according to keywords, finding a plurality of corresponding candidate documents according to preset keywords of a lecture paragraph, and sampling sentences from the candidate documents to generate lecture paragraph contents.
2. The method of automatically generating a lecture according to claim 1, wherein the processing the text comprises: and (4) combining the Word2Vec algorithm and the TextRank algorithm to reserve a plurality of sentences with the highest importance scores in the text as a candidate document.
3. The method of automatically generating a lecture according to claim 2, wherein the processing the text comprises the steps of:
1) sentence dividing is carried out on the text;
2) performing Word segmentation on each sentence, and taking the mean value of Word2Vec Word vectors as the semantic representation of the sentence;
3) calculating a sentence similarity matrix;
4) solving the importance score of each sentence according to a TextRank iterative formula and sequencing the sentences from high to low;
5) and taking a plurality of sentences with the highest importance scores to form a candidate document.
4. The method for automatically generating a lecture according to claim 1, wherein the step of sampling sentences from the candidate document to generate lecture paragraph contents comprises the steps of:
1) initializing the paragraph set to be empty;
2) querying all candidate documents corresponding to the keywords of the section in a database;
3) randomly sampling a candidate document from the candidate document set, and removing the candidate document from the candidate document set;
4) Randomly sampling a sentence from the candidate documents selected in the step 3) and adding the sentence into the paragraph set;
5) judging whether the number of words in the paragraph sentence set reaches the maximum limit, if so, ending the process, and successively splicing the sentences in the paragraph sentence set to output paragraph contents; otherwise, performing step 6);
6) judging whether the previous sentence sampled is the tail sentence of the candidate document, if so, returning to the step 3); otherwise, performing step 7);
7) and continuously sampling the next sentence from the content behind the previously selected sentence in the current candidate document with a preset probability p, adding the next sentence into the paragraph sentence set, jumping out of the current candidate document with the probability 1-p, and returning to the step 3).
5. An electronic device, comprising:
a processor; and
a memory having stored therein processor-executable instructions;
wherein the processor implements the method of any of claims 1-4 by executing the executable instructions.
6. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1-4.
7. A system for automatically generating lectures, the system comprising:
The network crawler module acquires related texts from the Internet according to subject words of the lecture draft;
an information extraction module that processes the text to generate candidate documents classified by keywords; and
the system comprises a lecture generation module, a database and a database, wherein the lecture generation module finds a plurality of corresponding candidate documents according to preset keywords of a lecture paragraph and samples sentences from the candidate documents to generate lecture paragraph contents.
CN202010559615.3A 2020-06-18 2020-06-18 Method for automatically generating lecture notes Pending CN111859950A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010559615.3A CN111859950A (en) 2020-06-18 2020-06-18 Method for automatically generating lecture notes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010559615.3A CN111859950A (en) 2020-06-18 2020-06-18 Method for automatically generating lecture notes

Publications (1)

Publication Number Publication Date
CN111859950A true CN111859950A (en) 2020-10-30

Family

ID=72987421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010559615.3A Pending CN111859950A (en) 2020-06-18 2020-06-18 Method for automatically generating lecture notes

Country Status (1)

Country Link
CN (1) CN111859950A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733515A (en) * 2020-12-31 2021-04-30 贝壳技术有限公司 Text generation method and device, electronic equipment and readable storage medium
CN116069936A (en) * 2023-02-28 2023-05-05 北京朗知网络传媒科技股份有限公司 Method and device for generating digital media article
CN116611417A (en) * 2023-05-26 2023-08-18 浙江兴旺宝明通网络有限公司 Automatic article generating method, system, computer equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
KR101508260B1 (en) * 2014-02-04 2015-04-07 성균관대학교산학협력단 Summary generation apparatus and method reflecting document feature
CN105183710A (en) * 2015-06-23 2015-12-23 武汉传神信息技术有限公司 Method for automatically generating document summary
CN107077460A (en) * 2014-09-30 2017-08-18 微软技术许可有限责任公司 Structuring sample author content
CN110188349A (en) * 2019-05-21 2019-08-30 清华大学深圳研究生院 A kind of automation writing method based on extraction-type multiple file summarization method
US20190354595A1 (en) * 2018-05-21 2019-11-21 Hcl Technologies Limited System and method for automatically summarizing documents pertaining to a predefined domain
CN111125349A (en) * 2019-12-17 2020-05-08 辽宁大学 Graph model text abstract generation method based on word frequency and semantics
CN111177365A (en) * 2019-12-20 2020-05-19 山东科技大学 Unsupervised automatic abstract extraction method based on graph model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
KR101508260B1 (en) * 2014-02-04 2015-04-07 성균관대학교산학협력단 Summary generation apparatus and method reflecting document feature
CN107077460A (en) * 2014-09-30 2017-08-18 微软技术许可有限责任公司 Structuring sample author content
CN105183710A (en) * 2015-06-23 2015-12-23 武汉传神信息技术有限公司 Method for automatically generating document summary
US20190354595A1 (en) * 2018-05-21 2019-11-21 Hcl Technologies Limited System and method for automatically summarizing documents pertaining to a predefined domain
CN110188349A (en) * 2019-05-21 2019-08-30 清华大学深圳研究生院 A kind of automation writing method based on extraction-type multiple file summarization method
CN111125349A (en) * 2019-12-17 2020-05-08 辽宁大学 Graph model text abstract generation method based on word frequency and semantics
CN111177365A (en) * 2019-12-20 2020-05-19 山东科技大学 Unsupervised automatic abstract extraction method based on graph model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733515A (en) * 2020-12-31 2021-04-30 贝壳技术有限公司 Text generation method and device, electronic equipment and readable storage medium
CN116069936A (en) * 2023-02-28 2023-05-05 北京朗知网络传媒科技股份有限公司 Method and device for generating digital media article
CN116611417A (en) * 2023-05-26 2023-08-18 浙江兴旺宝明通网络有限公司 Automatic article generating method, system, computer equipment and storage medium
CN116611417B (en) * 2023-05-26 2023-11-21 浙江兴旺宝明通网络有限公司 Automatic article generating method, system, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106776544B (en) Character relation recognition method and device and word segmentation method
CN107423282B (en) Method for concurrently extracting semantic consistency subject and word vector in text based on mixed features
CN116775847B (en) Question answering method and system based on knowledge graph and large language model
CN112214593A (en) Question and answer processing method and device, electronic equipment and storage medium
CN108538294B (en) Voice interaction method and device
CN111859950A (en) Method for automatically generating lecture notes
CN111930929A (en) Article title generation method and device and computing equipment
CN112101041A (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN112380866A (en) Text topic label generation method, terminal device and storage medium
CN115048944A (en) Open domain dialogue reply method and system based on theme enhancement
CN112528654A (en) Natural language processing method and device and electronic equipment
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN111159405A (en) Irony detection method based on background knowledge
CN112528653B (en) Short text entity recognition method and system
CN112445862B (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN110750632B (en) Improved Chinese ALICE intelligent question-answering method and system
CN112633007A (en) Semantic understanding model construction method and device and semantic understanding method and device
CN110888976B (en) Text abstract generation method and device
CN113609287A (en) Text abstract generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination