CN111078822A

CN111078822A - Reader information extraction method and system based on Chinese novel text

Info

Publication number: CN111078822A
Application number: CN201911201695.9A
Authority: CN
Inventors: 陈海滨; 崔泽鹏; 吴岳辛; 杨丹浩
Original assignee: Hongfujin Precision Industry Shenzhen Co Ltd; Beijing University of Posts and Telecommunications
Current assignee: Hongfujin Precision Industry Shenzhen Co Ltd; Beijing University of Posts and Telecommunications
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-04-28

Abstract

The invention provides a reader information extraction method and system based on a Chinese novel text, which are used for solving the problem that the extraction of Chinese text electronic book information with more plots is not accurate enough in the prior art. The reader information extraction method divides a text needing to extract keywords into texts with words as granularity and Chinese characters as granularity, converts the texts into word vectors and character vectors respectively, adopts a vector training neural network, extracts text characteristics by combining an attention mechanism, and completes information extraction according to a prediction result of the text characteristics. The text representation method with two granularities of the word vector and the word vector is adopted, the vector representation methods with the two granularities are simultaneously applied to the attention mechanism model, and the prediction results of the word vector and the word vector are combined, so that the information extraction accuracy of related tasks is improved, a reader is helped to understand text contents, the requirement that the reader needs to review the previous chapters is met by utilizing a natural language processing algorithm, and the user experience is improved.

Description

Reader information extraction method and system based on Chinese novel text

Technical Field

The invention belongs to the field of intelligent electronic books, and particularly relates to a reader information extraction method and system based on a Chinese novel text.

Background

With the advent of the network age, the number of people using electronic books is increasing. Compared with the traditional paper books, the electronic books have many conveniences. The capacity of the electronic book is large, and an App of one electronic book can hold a plurality of novels, so that more choices are provided for people. Although electronic books generally have functions of bookmarking, reciting book contents and the like, there is room for great improvement in intellectualization. For example, compared with a paper book, the electronic book displays less contents per page and has more pages, and when a reader reads the contents behind, if the reader wants to refer to the contents in the previous page, it is difficult to quickly reach the corresponding position.

In order to solve the problem of insufficient intellectualization of the electronic book, semantic recognition is generally adopted to locate the content of the electronic book. For example, chinese patent application No. 201810746982.7 proposes a semantic understanding-based voice interaction method and system for an operating system, which can make a more intelligent reply according to a voice instruction of a user.

In the prior art, in the aspect of electronic book reading, semantic recognition is usually completed by performing information extraction through an attention mechanism. The Chinese patent with the application number of 201810611199.x provides a method, a device and electronic equipment for reading and understanding based on an attention machine system, and the attention machine system based on word vectors is used to improve the effect of short text answer extraction tasks; chinese patent application No. 201810601421.8 proposes an answer selection method, apparatus and electronic device based on an improved attention mechanism, in which two layers of attention mechanism based on word vectors are stacked, and a system for machine reading understanding is optimized through a specific model architecture including the improved attention mechanism, so as to improve the short text answer extraction effect. However, for a Chinese text, the accuracy of the existing information extraction based on semantic recognition is not high, and especially on the reading text of an electronic book such as a novel, the effect of the information extraction is still not universal and perfect; in addition, when facing literary works such as a novel with more text contents and a novel with many chapters, the chapter story plot is complicated and tortuous, for a user, the user often cannot read one novel continuously, and the plot and task relationship of the previous novel can be partially forgotten when the user next reads the novel, so that the content which is previously viewed needs to be reviewed, and the existing information extraction technology cannot meet the requirement.

Disclosure of Invention

In order to improve the intelligent level of an electronic book and overcome the problem that the extraction of electronic book information of Chinese texts with more plot is not accurate enough, the invention provides a reader information extraction method and a reader information extraction system based on Chinese novel texts.

In order to achieve the purpose, the invention adopts the following technical scheme.

In a first aspect, an embodiment of the present invention provides a reader information extraction method based on a chinese novel text, which jointly extracts text features by using an attention mechanism under multiple vector granularities to obtain key information under a predetermined task.

In the above scheme, the information extraction method includes the following steps:

step S1, dividing the text needing extracting the key words into a text with words as granularity and a text with Chinese characters as granularity, and respectively converting the two texts into word vectors and character vector representations;

step S2, training a bidirectional long-short term memory neural network under a preset target by using the word vectors and the word vectors, and extracting first text features under the corresponding target;

step S3, processing the first text feature by adopting an attention mechanism to obtain a second text feature;

and step S4, filling the second text characteristic with a prediction label, taking the second text characteristic filled with the prediction label as a prediction result, and finishing information extraction according to the prediction result.

In the foregoing solution, the information extraction method further includes: and step S5, filtering the prediction result, and introducing a word vector prediction result into the word vector prediction result to modify the word vector prediction result.

In the above scheme, in step S1, the two texts are respectively converted into Word vectors and Word vector representations, and the Word2Vec model is used to complete the conversion of the text information into a vector form with a predetermined dimension.

In the above scheme, the bidirectional long and short term memory neural network BilSTM in step S2 is composed of three "gate" structures, which are an input gate, a forgetting gate, and an output gate.

In the above scheme, the output calculation formula of the input gate at time t is as follows:

i_t＝f(W_ix_t+W_ih_t-1+b_i) (1)

the output calculation formula of the forgetting gate is as follows:

f_t＝σ(W_fx_t+W_fh_t-1+b_f) (2)

the current time state element indicates:

the state cell at the previous time represents:

the output calculation formula of the output gate is as follows:

o_t＝f(W_ox_t+W_oh_t-1+b_o) (5)

the output of the current cell is represented as:

h_t＝o_ttanh(C_t) (6)

the BilSTM output in both directions can be expressed as:

h_t＝{h_ti，h_tj} (7)。

in the above scheme, in step S3, the attention mechanism is:

firstly, defining a matrix of attention mechanism, using output of BilSTM neural network as input, obtaining implicit expression e of node i to node j by means of nonlinear transformation_ij：

e_ij＝Vtanh(Wh_i+Uh_j+b) (8)

Where hi and hj represent the outputs of the forward and reverse LSTM neural networks, respectively, and V, W, U are weight matrices.

According to the weight, a new output characteristic value of the ith word can be calculated:

in the above solution, in step S4, the labels of the words are: t denotes that the current word belongs to the keyword, and F denotes that the current word does not belong to the keyword.

In the above scheme, the distribution probability of the corresponding category is calculated by the labeling through the softmax classifier, and the expression is as follows:

y_i＝softmax(W_cH+b_c) (12)

h is the output of the attention layer, W is the weight matrix, and b is the bias vector. During training, the objective is to minimize the loss function, which is the cross-entropy loss of the output vector of softmax and the correct label of the sample:

H_y′(y)＝-∑_iy′_ilog(y_i) (13)

wherein, y'_iValue, y, representing the ith correct tag_iThe value of the i-th tag in the output vector representing softmax.

In a second aspect, an embodiment of the present invention further provides a reader information extraction system based on a chinese novel text, where the system includes: the system comprises a text word vector representation layer, a bidirectional long-short term neural network (BilSTM) layer, an attention mechanism layer, a label classification layer, a prediction result filter layer and a result output layer; wherein the content of the first and second substances,

the text word vector representation layer and the text word vector representation layer are simultaneously connected with the BilSTM layer and are used for acquiring a text needing extracting keywords, dividing the text needing extracting the keywords into a text with words as granularity and a text with Chinese characters as granularity, respectively converting the two texts into word vectors and word vector representations, and respectively sending the word vectors and the word vectors to the BilSTM layer;

the BilSTM layer is connected with the attention mechanism layer and is used for training a bidirectional long-short term memory neural network under a preset target by adopting the word vectors and the word vectors, extracting first text characteristics under the corresponding target and sending the first text characteristics to the attention mechanism processing layer;

the attention mechanism processing layer is connected with the label classification layer and is used for processing the first text characteristic by adopting an attention mechanism to obtain a second text characteristic and sending the second text characteristic to the label classification layer;

the label classification layer is connected with the prediction result filtering layer and is used for filling the second text features with prediction labels, taking the second text features filled with the prediction labels as prediction results and sending the prediction results to the prediction result filtering layer;

the prediction result filtering layer is connected with the result output layer and is used for filtering the prediction result, introducing a word vector prediction result into the prediction result of the word vector to modify the word vector prediction result, and sending the modified prediction result to the result output layer;

and the result output layer is used for outputting the filtered keywords.

According to the technical scheme provided by the embodiment of the invention, the reader information extraction method and system based on the Chinese novel text divide the text needing extracting the keywords into the text with the word as the granularity and the text with the Chinese character as the granularity, respectively convert the text into the word vector and the character vector, adopt the vector training neural network, extract the text characteristics by combining the attention mechanism, and complete information extraction according to the prediction result of the text characteristics. The method adopts an attention mechanism under various vector granularities, uses a text representation method with two granularities of word vectors and word vectors, simultaneously applies the vector representation methods with the two granularities to an attention mechanism model, combines the prediction results of the two, improves the information extraction accuracy of related tasks, helps readers to understand text contents, quickly summarizes main characters, character relations, plots and the like of the current chapter by using a natural language processing algorithm, meets the requirement that the readers need to review the previous chapter, and improves user experience.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a reader information extraction method based on Chinese novel texts in an embodiment of the present invention;

FIG. 2 is a diagram illustrating a filtering result of an information extraction method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a reader information extraction system based on a chinese novel text according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The embodiment of the invention aims at Chinese novel dialect texts, improves the information extraction method of the semantic identification reader, adopts an attention mechanism under various vector granularities, extracts key information of characters, identifies semantic contents of the reader, extracts main contents, core ideas and key information of articles, provides a convenient function for understanding article contents for applicable people, particularly people with reading disorders, and improves reading experience. The embodiment of the invention adopts an attention mechanism under various vector granularities, uses a text representation method of two granularities of a word vector and a word vector, simultaneously applies the vector representation methods of the two granularities to an attention mechanism model, and combines the prediction results of the two to improve the information extraction accuracy of related tasks. Specifically, in the reader information extraction method, firstly, an automatic filtering method is used for filtering out non-Chinese samples and related characters, and then word segmentation technology is used for generating text representation based on word vectors and text expression based on word vectors.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

First embodiment

The embodiment provides a reader information extraction method based on a Chinese novel text, which adopts an attention mechanism under various vector granularities to jointly extract text features to obtain key information under a preset task. Fig. 1 is a schematic flow chart of the reader information extraction method. As shown in fig. 1, the information extraction method specifically includes the following steps:

Further, the information extraction method may further include:

and step S5, filtering the prediction result, introducing a word vector prediction result into the word vector prediction result, correcting a word vector prediction model, and improving the prediction accuracy and the recall rate.

In this embodiment, in step S1, the two texts are respectively converted into Word vectors and Word vector representations, and a Word2Vec model is used to complete conversion of text information into a vector form with a predetermined dimension, so that the neural network can perform efficient feature extraction calculation conveniently. The Word2Vec model includes both Skip-gram and CBOW training approaches, the CBOW model predicts the center Word Wt using the context words of Word Wt, while Skip-gram predicts its context words with the knowledge of Word Wt Wct. Preferably, the present embodiment trains the word vector in a Skip-gram manner.

In order to make up the deficiency of the word vector model in training, text features are further extracted by using the vectors of the word granularity. The word vector is that the Chinese text is trained according to each word, and a vector with a certain dimension is generated for each Chinese character. The word vector can better express the meaning of each Chinese character, can also be used as the input of a neural network, is used for text feature extraction, and becomes an effective supplement of the word vector.

In addition, the text needing extracting the key words in the step can also provide voice input, convert the voice input of the user into characters, and extract the related information according to the character instruction. Through the voice command, corresponding functions are realized quickly, and user experience is improved for users who are inconvenient to type. For example, the user can select a chapter, speak the contents of "subject of the chapter" and "main character" using a microphone, convert the voice into text, and extract information.

In this embodiment, in step S2, the bidirectional long-short term memory neural network (BiLSTM) is an improved structure of the long-short term neural network LSTM, and is composed of three "gate" structures, which are respectively called an input gate, a forgetting gate, and an output gate, and combines the results of the two directions as an output.

Wherein, the output calculation formula of the input gate at the moment t is as follows:

i_t＝f(W_ix_t+W_ih_t-1+b_i) (1)

the output calculation formula of the forgetting gate is as follows:

f_t＝σ(W_fx_t+W_fh_t-1+b_f) (2)

the current time state element indicates:

the state cell at the previous time represents:

the output calculation formula of the output gate is as follows:

o_t＝f(W_ox_t+W_oh_t-1+b_o) (5)

the output of the current cell is represented as:

h_t＝o_ttanh(C_t) (6)

the BilSTM output for both directions can be expressed as:

h_t＝{h_ti，h_tj} (7)

in formulae (1) to (7), i_t，f_t，C_t，o_tRespectively denote LSTMThe output functions of an input gate, a forgetting gate, a current state unit and an output gate of the neural network at the time t, W represents the weight matrix of the current neuron, b represents the offset vector of the current neuron, and x_tIs the current input variable.

The predetermined training goals comprising: characters, character relationships, storylines, primary content. The information extraction method of the embodiment has a certain function expansibility while having a strong extraction capability. The key information extraction can respectively extract different types of information according to different training targets. For example, if the training target is the character relationship of the text, the training target of the model can be set as the name of the person appearing in the text and the relationship thereof, and the characteristic of the model is labeled by using the sequence to complete the task of extracting the relationship. If the training target is the main plot of the section, the words related to the main plot can be used as the training target, and the task of extracting the key information related to the plot can be completed.

In step S3, the attention mechanism is:

e_ij＝Vtanh(Wh_i+Uh_j+b) (8)

Wherein h is_iAnd h_jRepresenting the outputs of the forward and inverse LSTM neural networks, respectively, V, W, U are weight matrices. Among the n time nodes, the attention probability weight of the ith node to the jth node can be expressed as:

by the above method, the same can be doneTo calculate a new inverse LSTM eigenvalue h for the ith word_ajAnd LSTM cell eigenvalues h_acTherefore, the output characteristics corresponding to the tth word having the Attention mechanism are as follows:

h_t＝{h_ai，h_aj} (11)

in step S4, the labels of the words are T and F, which respectively indicate that the current word belongs to the keyword and does not belong to the keyword. The layer can map the previously output high-dimensional features to low-dimensional categories, and calculates the distribution probability of the corresponding categories through a softmax classifier, wherein the expression is as follows:

y_i＝softmax(W_cH+b_c) (12)

H_y′(y)＝-∑_iy′_ilog(y_i) (13)

wherein, y'_iValue, y, representing the ith correct tag_iThe value of the i-th tag in the output vector representing softmax. And obtaining a prediction label corresponding to the ith word through the output layer.

In addition, in step S5, based on the output of the label classification layer, the prediction result filter layer is used to modify the prediction model mainly based on the word vector prediction model and the word vector prediction model in combination with the prediction results of the two granularities of the word vector prediction model and the word vector prediction model, thereby improving the prediction accuracy and the recall rate. In order to better predict the keywords and reduce the phenomenon of overfitting, a training result of a word vector model is introduced, and a word predicted by the word vector and the word vector is taken as a final prediction result by combining the prediction results of the word vector and the word vector.

Fig. 2 is a schematic diagram illustrating a filtering result of the information extraction method. As shown in fig. 2, the text information "natural language processing teaching algorithm" is taken as an example to extract the keywords, and the result of the keywords obtained by the information extraction in step S2, step S3, and step S4 is "natural, language, processing, algorithm" through the filtering in this step.

According to the technical scheme, the reader information extraction method based on the Chinese novel text helps readers to understand text contents by using an artificial intelligent algorithm, quickly summarizes main characters, character relations, main plots and the like of the current chapter by using a natural language processing algorithm, and meets the requirement that the readers need to review the previous chapter; the query history can be recorded, so that the user can conveniently review the previous contents, and the same instruction does not need to be repeatedly executed; corresponding functions can be quickly realized through voice instructions, voice input is provided for users who are inconvenient to type, and user experience is improved. .

Second embodiment

The embodiment provides a reader information extraction system based on a Chinese novel text, and fig. 3 is a schematic structural diagram of the reader information extraction system. As shown in fig. 3, the system includes: the system comprises a text word vector representation layer, a bidirectional long-short term neural network (BilSTM) layer, an attention mechanism layer, a label classification layer, a prediction result filter layer and a result output layer.

the prediction result filtering layer is connected with the result output layer and used for filtering the prediction result, the word vector prediction result is introduced into the word vector prediction result to modify the word vector prediction result, and the modified prediction result is sent to the result output layer, so that the prediction accuracy and the recall rate are improved. In order to better predict the keywords and reduce the phenomenon of overfitting, a training result of a word vector model is introduced, and a word predicted by the word vector and the word vector is taken as a final prediction result by combining the prediction results of the word vector and the word vector.

And the result output layer is used for outputting the filtered keywords.

In particular, the reader information extraction system of this embodiment may further include a voice acquisition layer, which is configured to record a voice instruction of a user, identify a corresponding command according to the instruction of the user, and determine whether the instruction of the user can be effectively identified. For some users with inconvenient operation, the corresponding function is realized by speaking simple instructions.

The concept of layers in this embodiment is a physical structure, and each layer is realized by a CPU or an editable logic controller. Meanwhile, the information storage and the specific calculation process are completed through the server.

The reader information extraction system based on the Chinese novel text in this embodiment is opposite to the reader information extraction method based on the Chinese novel text in the first embodiment, and the feature description of the reader information method is also applicable to the reader information extraction system in this embodiment, and is not repeated herein.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of ordinary skill in the art will understand that: the components in the devices in the embodiments may be distributed in the devices in the embodiments according to the description of the embodiments, or may be correspondingly changed in one or more devices different from the embodiments. The components of the above embodiments may be combined into one component, or may be further divided into a plurality of sub-components.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A reader information extraction method based on Chinese novel texts is characterized in that a attention mechanism is adopted under various vector granularities to jointly extract text features so as to obtain key information under a preset task.

2. The reader information extraction method according to claim 1, characterized by comprising the steps of:

3. The reader information extraction method according to claim 2, characterized by further comprising: and step S5, filtering the prediction result, introducing a word vector prediction result into the word vector prediction result, and correcting the word vector prediction model.

4. The method for extracting reader information according to claim 2 or 3, wherein in step S1, the two texts are respectively converted into Word vector and Word vector representations, and the conversion of the text information into vector form with predetermined dimension is completed by using Word2Vec model.

5. The method for extracting reader information according to claim 2 or 3, wherein the bidirectional long-short term memory neural network BilSTM in step S2 is composed of three "gate" structures, namely an input gate, a forgetting gate and an output gate.

6. The reader information extraction method according to claim 5,

the output calculation formula of the input gate at the time t is as follows:

i_t＝f(W_ix_t+W_ih_t-1+b_i) (1)

the output calculation formula of the forgetting gate is as follows:

f_t＝σ(W_fx_t+W_fh_t-1+b_f) (2)

the current time state element indicates:

the state cell at the previous time represents:

the output calculation formula of the output gate is as follows:

o_t＝f(W_ox_t+W_oh_t-1+b_o) (5)

the output of the current cell is represented as:

h_t＝o_ttanh(C_t) (6)

the BiLSTM output in both directions can be expressed as:

h_t＝{h_ti，h_tj) (7)；

wherein i_t，f_t，C_t，o_tRespectively representing the output functions of an input gate, a forgetting gate, a current state unit and an output gate of the LSTM neural network at the time t, W representing the weight matrix of the current neuron, b representing the offset vector of the current neuron, and x_tIs the current input variable.

7. The reader information extraction method according to claim 2 or 3, wherein in the step S3, the attention mechanism is:

first, a matrix of attention mechanism is defined, and Bi is addedThe output of the LSTM neural network is used as input, and the implicit expression e of the node i to the node j is obtained through nonlinear transformation_ij：

e_ij＝Vtanh(Wh_i+Uh_j+b) (8)

Wherein h is_iAnd h_jRepresenting the outputs of the forward and inverse LSTM neural networks, respectively, V, W, U is a weight matrix; in n time nodes, the attention probability weight of the ith node to the jth node is expressed as:

and calculating to obtain a new output characteristic value of the ith word according to the weight:

computing a new inverse LSTM eigenvalue h for the ith word_ajThe output characteristics corresponding to the t-th word with attention mechanism are as follows:

h_t＝{h_ai，h_aj} (11)。

8. the reader information extraction method according to claim 2 or 3, wherein in step S4, the labels of the words are: t denotes that the current word belongs to the keyword, and F denotes that the current word does not belong to the keyword.

9. The reader information extraction method according to claim 8, wherein the distribution probability of the corresponding category is calculated by the labeling through a softmax classifier, and an expression is as follows:

y_i＝softmax(W_cH+b_c) (12)

h is the output of the attention layer, W is the weight matrix, b is the bias vector; during training, the objective is to minimize a loss function, and the cross entropy loss of the output vector of the loss function softmax and the correct label of the sample is:

H_y′(y)＝-∑_iy′_ilog(y_i) (13)

10. A reader information extraction system based on Chinese novel texts is characterized by comprising: the system comprises a text word vector representation layer, a bidirectional long-short term neural network (BilSTM) layer, an attention mechanism layer, a label classification layer, a prediction result filter layer and a result output layer; wherein the content of the first and second substances,

and the result output layer is used for outputting the filtered keywords.