CN112035646A

CN112035646A - Key content extraction method

Info

Publication number: CN112035646A
Application number: CN202010905863.9A
Authority: CN
Inventors: 王鑫
Original assignee: Shanghai Squirrel Classroom Artificial Intelligence Technology Co Ltd
Current assignee: Shanghai Squirrel Classroom Artificial Intelligence Technology Co Ltd
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2020-12-04

Abstract

The invention discloses a key content extraction method, which comprises the steps of acquiring subject information of key content to be extracted, and generating a corresponding subject knowledge base according to the subject information; extracting an original text from the subject knowledge base, and performing data processing on the original text to obtain a corresponding target text; performing word segmentation processing and clustering analysis on the target text, and obtaining key contents in the target text according to a preset analysis method, wherein the key contents comprise knowledge points and/or keywords. According to the technical scheme, the purpose of automatically extracting the key content from the text corresponding to the subject is achieved, the extraction efficiency and the extraction accuracy of the key content are improved, and compared with a manual extraction mode for manually marking the exercises, the extraction mode of the key content improves the working efficiency and saves a large amount of manpower.

Description

Key content extraction method

Technical Field

The invention relates to the technical field of data processing, in particular to a key content extraction method.

Background

With the continuous development and progress of computer technology and internet technology and the gradual popularization of intelligent electronic products, the learning of students is gradually completed by means of electronic products in consideration of the intelligence and convenience of the learning of electronic products. Therefore, a large number of electronic problems also exist in teaching students. At present, knowledge points and corresponding keywords are basically confirmed in a manual labeling mode, so that the working efficiency is low and the workload is large.

Disclosure of Invention

The invention provides a key content extraction method, and aims to realize automatic extraction of key content corresponding to an electronic exercise.

The invention provides a method for extracting key content, which comprises the following steps:

acquiring subject information of key content to be extracted, and generating a corresponding subject knowledge base according to the subject information;

extracting an original text from the subject knowledge base, and performing data processing on the original text to obtain a corresponding target text;

performing word segmentation processing and clustering analysis on the target text, and obtaining key contents in the target text according to a preset analysis method, wherein the key contents comprise knowledge points and/or keywords.

Further, the acquiring subject information of the key content to be extracted and generating a corresponding subject knowledge base according to the subject information includes:

analyzing the subject information to acquire known subject knowledge points and known subject keywords corresponding to the subject information;

and generating a discipline knowledge base containing the known discipline knowledge points and the known discipline keywords according to the acquired known discipline knowledge points and the known discipline keywords.

Further, the generating a discipline knowledge base containing the known discipline knowledge points and the known discipline keywords according to the acquired known discipline knowledge points and the known discipline keywords comprises:

and labeling the known subject knowledge points and the known subject keywords according to the acquired known subject knowledge points and the known subject keywords, taking the labeled known subject knowledge points and the labeled known subject keywords as label samples, and generating a subject knowledge base containing the label samples.

acquiring subject information of key content to be extracted, analyzing the subject information, and acquiring a subject type and subject characteristics corresponding to the subject information;

acquiring professional subject vocabularies and high-frequency vocabularies corresponding to the subject types and the subject characteristics according to the subject types and the subject characteristics;

and marking the subject vocabulary and the high-frequency vocabulary, taking the marked subject vocabulary and the high-frequency vocabulary as label samples, and generating a subject knowledge base containing the label samples.

acquiring subject information of key content to be extracted, and collecting known subject knowledge points and known subject keywords from the subject information;

and generating a discipline knowledge base corresponding to the discipline knowledge graph according to the collected known discipline knowledge points and the known discipline keywords.

Further, the performing data processing on the original text to obtain a corresponding target text includes:

and according to the subject knowledge base, performing data preprocessing on the original text, and removing irrelevant characters containing spaces in the original text to obtain a corresponding target text.

Further, the performing word segmentation processing and cluster analysis on the target text, and obtaining key content in the target text according to a preset analysis method includes:

performing word segmentation processing on the target text to obtain a plurality of corresponding word segmentation words, and calculating the current heat value of each word segmentation word;

performing cluster analysis on the word segmentation vocabularies to obtain word segmentation vocabulary sets corresponding to the word segmentation vocabularies;

extracting target words in each participle word set according to a preset N word extraction modes to obtain a plurality of extracted word sets corresponding to each participle word set, wherein each extracted word set comprises corresponding target words;

determining a comprehensive effective value corresponding to each extracted vocabulary set according to the current heat value of the target vocabulary;

sequencing the comprehensive effective values from big to small to obtain the first n extracted vocabulary sets;

and extracting the key content of each extracted vocabulary set in the first n extracted vocabulary sets to obtain the key content in the target text.

Further, the calculating the current heat value of each participle word comprises:

calculating the current heat value of each participle word by using the formula (1):

in the formula (1), S_kRepresenting the current heat value of the kth participle word; beta is a_kThe vocabulary attribute value of the kth word segmentation vocabulary is a preset value and has a value range of [1,5 ]](ii) a n represents the number of unit time periods included in a preset total time period; chi shape_kiRepresenting the attention degree of the kth participle word in the ith unit time period; chi shape_k' represents an average degree of attention of the k-th segmented word in a total time period; chi shape_kmaxRepresenting the maximum attention degree of the k-th word segmentation vocabulary in all unit time periods in a total time period;

wherein, the x_kiIs calculated as follows in equation (2):

wherein p is_kiRepresenting the searching frequency of the kth participle word in the ith unit time period; p1_iIndicating the total frequency of searching different segmented words in the ith unit time period.

Further, the determining, according to the current heat value of the target vocabulary, a respective corresponding comprehensive effective value of each extracted vocabulary set includes:

and (3) calculating a comprehensive effective value corresponding to each extracted vocabulary set by using a formula (3) and a formula (4), wherein the comprehensive effective values comprise:

wherein Z is_aA comprehensive effective value representing the a-th extracted vocabulary set; m represents the total number of all the finally extracted vocabularies when the target vocabularies in the a-th extracted vocabulary set are respectively extracted by the N vocabulary extraction modes; s_ajA current heat value representing the extracted jth word; p is a radical of_amaxRepresenting the extracted probability corresponding to the vocabulary with the maximum number of times of extraction in the extracted vocabularies when the target vocabularies in the a-th extracted vocabulary set are extracted by the N vocabulary extraction modes; p is a radical of_aminRepresenting the extracted probability corresponding to the vocabulary with the least number of times of extraction in the extracted vocabularies when the target vocabularies in the a-th extracted vocabulary set are extracted by the N vocabulary extraction modes;

d_ajrepresenting the total times of occurrence of the j-th extracted vocabulary in the extraction process by adopting N vocabulary extraction modes; k is a radical of_adThe number of the words extracted in the process of extracting the a-th extracted word set by adopting the d-th word extraction mode is shown.

The key content extraction method comprises the steps of acquiring subject information of key content to be extracted, and generating a corresponding subject knowledge base according to the subject information; extracting an original text from the subject knowledge base, and performing data processing on the original text to obtain a corresponding target text; performing word segmentation processing and clustering analysis on the target text, and obtaining key contents in the target text according to a preset analysis method, wherein the key contents comprise knowledge points and/or keywords; the method and the device achieve the purpose of automatically extracting the knowledge points and the key words corresponding to the electronic exercises, improve the extraction efficiency and the extraction accuracy of the knowledge points and the key words, and compared with a manual extraction mode of manually marking the exercises, the extraction mode of the knowledge points and the key words improves the working efficiency, reduces the error rate and saves a large amount of manpower.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described below by means of the accompanying drawings and examples.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic workflow diagram of an embodiment of the key content extraction method of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

The invention provides a key content extraction method, which solves the problems of low working efficiency and large workload of manual marking exercises and achieves the purpose of automatically extracting corresponding knowledge points and key words of electronic exercises.

As shown in fig. 1, fig. 1 is a schematic workflow diagram of an embodiment of a key content extraction method according to the present invention; the key content extraction method of the present invention may be implemented as steps S10-S30 described below.

And step S10, acquiring subject information of the key content to be extracted, and generating a corresponding subject knowledge base according to the subject information.

In the embodiment of the invention, a system acquires subject information of key content to be extracted; wherein the subject information comprises: the corresponding disciplines of mathematics, language, physics, chemistry, and the like and all electronic exercises corresponding to the disciplines. Generating a corresponding subject knowledge base according to the subject information; in order to facilitate extraction of knowledge points and keywords, the subject knowledge base may only include subject information corresponding to a subject.

And step S20, extracting an original text from the discipline knowledge base, and performing data processing on the original text to obtain a corresponding target text.

The original text may be an electronic problem or other subject related text.

Extracting an original text from the discipline knowledge base based on the generated discipline knowledge base, and when data preprocessing is performed on the original text, eliminating contents irrelevant to words in the original text, such as spaces and other characters without any symbolic meaning, so as to obtain a corresponding target text.

And step S30, performing word segmentation processing and clustering analysis on the target text, and obtaining key contents in the target text according to a preset analysis method, wherein the key contents comprise knowledge points and/or keywords.

In the embodiment of the invention, word segmentation processing is carried out on the target text to obtain corresponding word segmentation words, and then clustering analysis is carried out on the obtained word segmentation words to obtain corresponding word segmentation sets. The preset analysis method includes but is not limited to: and extracting target words from the word segmentation sets according to corresponding extraction modes based on the obtained word segmentation sets so as to obtain corresponding extraction sets, and extracting corresponding knowledge points and/or keywords based on the extraction sets and outputting the knowledge points and/or keywords.

In an embodiment, the acquiring subject information of the key content to be extracted, and generating a corresponding subject knowledge base according to the subject information may be implemented as follows:

In one embodiment, the generating a discipline knowledge base containing the known discipline knowledge points and the known discipline keywords according to the acquired known discipline knowledge points and the known discipline keywords may be implemented as follows:

In an embodiment, the acquiring subject information of the key content to be extracted, and generating a corresponding subject knowledge base according to the subject information may also be implemented as follows:

In the embodiment of the invention, known professional subject vocabularies and high-frequency vocabularies are labeled and then stored into a corresponding subject knowledge base as label samples. For example, moments, newtons (international units of magnitude of a weighing force), and the like appearing in the physical discipline are labeled and stored as corresponding label samples in the physical discipline knowledge base.

In an embodiment, the data processing on the original text to obtain the corresponding target text may be implemented as follows:

In an embodiment, the performing word segmentation and clustering analysis on the target text, and obtaining the key content in the target text according to a preset analysis method may be implemented as follows:

In one embodiment, the calculating the current heat value of each participle word may be performed as follows:

in the formula (1), S_kRepresenting the current heat value of the kth participle word; beta is a_kThe vocabulary attribute value of the kth word segmentation vocabulary is a preset value and has a value range of [1,5 ]](ii) a n represents the number of unit time periods included in a preset total time period; chi shape_kiRepresenting the attention degree of the kth participle word in the ith unit time period; 'chi'_kRepresenting the average attention degree of the k-th participle in a total time period; chi shape_kmaxRepresenting the maximum attention degree of the k-th word segmentation vocabulary in all unit time periods in a total time period;

wherein, the x_kiIs calculated as follows in equation (2):

In one embodiment, the determining the respective comprehensive valid value for each extracted vocabulary set according to the current heat value of the target vocabulary may be implemented as follows:

In the implementation of the present invention, the current heat value of the target vocabulary is determined according to the current heat value of the participle vocabulary, and the number of the types of the vocabulary in the extracted set is less than or equal to the number of the types of the vocabulary in the corresponding participle set.

The above N vocabularies may be extracted by using an attribute related to the popularity of a vocabulary, or by using an attribute related to the difficulty level corresponding to the vocabulary.

In the implementation of the invention, the target problem is obtained by preprocessing, and the efficiency of obtaining the subsequent word segmentation is improved; providing a data base for acquiring knowledge points by calculating the current heat value of each vocabulary; through carrying out cluster analysis to the vocabulary, and adopt and predetermine the extraction mode, be convenient for establish accurate extraction set, can be effective and comprehensive screening key vocabulary, and synthesize the virtual value through calculating every extraction set, be convenient for confirm the validity of this set, through n1 extraction sets before the screening, and based on subject knowledge base, be convenient for acquire effectual key content.

In an embodiment, in the embodiment shown in fig. 1, "step S10, acquiring subject information of the key content to be extracted, and generating a corresponding subject knowledge base according to the subject information", may also be implemented according to the following technical means:

acquiring subject information of key content to be extracted, and collecting known subject knowledge points and known subject keywords from the subject information; and generating a discipline knowledge base corresponding to the discipline knowledge graph according to the collected known discipline knowledge points and the known discipline keywords.

In the embodiment of the invention, the concept of the knowledge graph is introduced, and the incidence relation between different vocabularies in the subject information is correspondingly displayed through the knowledge graph. The processing mode is suitable for describing chromosome related information between a plurality of bodies with genetic relations in a discipline for representing the association relations between different vocabularies through a graph, such as a biological discipline. The embodiment of the invention can also mark the known discipline knowledge points and the known discipline keywords as the label samples and store the label samples in the discipline knowledge base.

The key content extraction method comprises the steps of acquiring subject information of key content to be extracted, and generating a corresponding subject knowledge base according to the subject information; extracting an original text from the subject knowledge base, and performing data processing on the original text to obtain a corresponding target text; performing word segmentation processing and clustering analysis on the target text, and obtaining key contents in the target text according to a preset analysis method, wherein the key contents comprise knowledge points and/or keywords. According to the technical scheme, the purpose of automatically extracting the key content of the text corresponding to the subject can be achieved, the extraction efficiency and the extraction accuracy of the key content are improved, compared with a mode of manually extracting the label exercises, the extraction mode of the knowledge points and the key words improves the working efficiency, reduces the error rate, and saves a large amount of manpower.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for extracting key content, the method comprising:

2. The method for extracting key content according to claim 1, wherein the obtaining subject information of the key content to be extracted and generating a corresponding subject knowledge base according to the subject information comprises:

3. The method for extracting key content according to claim 2, wherein the generating a discipline knowledge base containing the known discipline knowledge points and known discipline keywords according to the acquired known discipline knowledge points and known discipline keywords comprises:

4. The method for extracting key content according to claim 1, wherein the obtaining subject information of the key content to be extracted and generating a corresponding subject knowledge base according to the subject information comprises:

5. The method for extracting key content according to claim 1, wherein the obtaining subject information of the key content to be extracted and generating a corresponding subject knowledge base according to the subject information comprises:

6. The method for extracting key content according to any one of claims 1 to 5, wherein the performing data processing on the original text to obtain a corresponding target text comprises:

7. The method for extracting key content according to any one of claims 1 to 5, wherein the performing word segmentation and cluster analysis on the target text and obtaining the key content in the target text according to a preset analysis method comprises:

8. The method of claim 7, wherein the calculating the current heat value of each segmented word comprises:

in the formula (1), S_kRepresenting the current heat value of the kth participle word; beta is a_kThe vocabulary attribute value of the kth word segmentation vocabulary is a preset value and has a value range of [1,5 ]](ii) a n represents the number of unit time periods included in a preset total time period; chi shape_kiIndicating that the k word segmentation is at the ithAttention per unit time period; 'chi'_kRepresenting the average attention degree of the k-th participle in a total time period; chi shape_kmaxRepresenting the maximum attention degree of the k-th word segmentation vocabulary in all unit time periods in a total time period;

wherein, the x_kiIs calculated as follows in equation (2):

9. The method for extracting key content according to claim 7, wherein said determining a respective comprehensive valid value for each set of extracted words according to the current heat value of the target words comprises:

wherein Z is_aA comprehensive effective value representing the a-th extracted vocabulary set; m represents the total number of all the finally extracted vocabularies when the target vocabularies in the a-th extracted vocabulary set are respectively extracted by the N vocabulary extraction modes; s_ajA current heat value representing the extracted jth word; p is a radical of_amaxIndicating that the target words in the a-th extracted word set are respectively extracted by the N word extraction modesDuring extraction, the extracted probability corresponding to the vocabulary with the largest number of times of extraction in the extracted vocabularies; p is a radical of_aminRepresenting the extracted probability corresponding to the vocabulary with the least number of times of extraction in the extracted vocabularies when the target vocabularies in the a-th extracted vocabulary set are extracted by the N vocabulary extraction modes;