CN113505207A

CN113505207A - Machine reading understanding method and system for financial public opinion research and report

Info

Publication number: CN113505207A
Application number: CN202110748656.1A
Authority: CN
Inventors: 成昊; 龚慧敏; 敖翔
Original assignee: Zhongke Suzhou Intelligent Computing Technology Research Institute
Current assignee: Zhongke Suzhou Intelligent Computing Technology Research Institute
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-10-15
Anticipated expiration: 2041-07-02
Also published as: CN113505207B

Abstract

The invention discloses a machine reading understanding method and a machine reading understanding system for financial public opinion research, wherein the method mainly comprises data formulation and collection, training data marking, deep learning model construction and answer organization, and specifically predefines the question set of a user according to the requirements of the financial vertical field, and collects public opinion data associated with the question set; finding out data which is matched with the problems in the predefined problem set from public opinion data through keyword matching, screening out sentences containing problem answers in the data by using a supervised model, and labeling the data; acquiring vector representation of characters by using a pre-trained BERT model in the financial field, and interacting data and problems by using an attention mechanism in a natural language processing method to obtain fusion vector representation which can be understood by a computer; and logically combining more than two answers fed back by the deep learning model. According to the technical scheme, the accuracy rate and the processing efficiency of machine reading understanding are improved by using the supervised model of the labeled data.

Description

Machine reading understanding method and system for financial public opinion research and report

Technical Field

The invention relates to a technology for solving article semantics and answering related questions by a computer, in particular to a method and a system for machine reading understanding in the financial field based on a supervised and deep learning algorithm.

Background

Machine Reading Comprehension (MRC) is a technique that uses algorithms to make computing mechanisms solve article semantics and answer related questions. Since both articles and questions take the form of human language, machine-read understanding falls into the category of Natural Language Processing (NLP) and is one of the most recent topics among them. In recent years, with the development of machine learning, especially deep learning, machine reading understanding research has advanced sufficiently, and the head and corner of the user can be exposed in practical application.

More than 2016, statistical learning methods were used, involving a large amount of feature engineering, which was time-consuming and labor-intensive. After 2016, SQuAD datasets were released, some attention-based matching models, such as BiDAF, LSTM, etc., appeared. This has been followed by relatively complex models of various network structures, and correlation efforts have been made to capture matching relationships between questions and chapters through complex network structures. After 2018, with the emergence of various pre-trained language models, the reading understanding model effect is greatly improved in a near step, because the capability of a presentation layer becomes very strong, and a task-related network structure becomes simple.

In machine-reading understanding technology applications, there are four common tasks, which are described below:

firstly, completing shape filling: given article C, hiding one of the words or entities a (a ∈ C) as a question to fill in the gap, the completion gap-filling task requires that the correct word or entity a be filled in by maximizing the conditional probability P (a | C- { a }).

II, selecting a plurality of items: given an article C, a question Q, and a series of candidate answer sets, the multiple choice task picks out the correct answer question Q from the candidate answer set a by maximizing the conditional probability.

Thirdly, fragment extraction: given an article C (which contains n words) and a question Q, the segment extraction task extracts successive subsequences from the article as correct answers to the question by maximizing the conditional probability P (a | C, Q).

Fourthly, freely answering: given article C and question Q, the correct answer a to answer freely may sometimes not be a subsequence of article C, i.e., a ⊆ C or a lean C. The free-answer task predicts the correct answer a to answer the question Q by maximizing the conditional probability P (a | C, Q).

Free question answering is the most difficult of the four tasks and is also the task of most interest and concern in the industry. The answer form of the free answer task is very flexible, the understanding of natural language can be well tested, and the method is most close to the practical application, but the data set structure of the task is relatively difficult, and how to effectively evaluate the model effect needs to be deeply researched.

As shown in fig. 1, a typical machine reading understanding system generally includes four modules, namely, embedded coding, feature extraction, article-question interaction and answer prediction, which are described as follows:

embedding and coding: this module converts the input articles and questions in natural language into fixed-dimension vectors for subsequent processing by the machine. Early commonly used methods were traditional word representation methods such as one-hot representation and distributed word vectors, and context-based word representation methods pre-trained by large-scale corpora in recent two years have also been widely used, such as ELMo, GPT, Bert, and the like. Meanwhile, in order to better represent information such as semantic syntax, the word vector may be combined with linguistic features such as part-of-speech tags, named entities, question types, and the like to represent the word vector at a finer granularity.

Feature extraction: the word vector representations of the articles and questions encoded via the embedded coding layer are then passed to the feature extraction module to extract more context information. Common neural network models used in this module are Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and transform structures based on a multi-headed self-attention mechanism.

Article-question interaction: to achieve this goal, the article-question interaction module often uses a one-way or two-way attention mechanism to emphasize the portions of the original text that are more relevant to the question. Meanwhile, in order to deeply mine the relationship between the article and the question, the interaction process between the article and the question may be performed multiple times, so as to simulate the repeated reading behavior of human beings in reading understanding.

And (3) answer prediction: this module makes the final answer prediction based on the accumulated information from the three modules. The implementation of this module is highly task-dependent, as common machine-reading understanding tasks can be categorized by answer type.

However, the accuracy of the existing machine reading understanding model cannot meet the relatively complex requirements of the financial field in the industry, the response speed cannot meet the requirements of real-time question and answer, and the problem that the answer cannot be answered cannot be identified, so that the given answer and the problem do not meet or are far away from each other under specific conditions, and the reference significance is lacked.

Disclosure of Invention

In view of the defects of the prior art, the invention aims to provide a machine reading understanding method and a machine reading understanding system for financial public opinion research and newspaper, and solves the problems of insufficient accuracy and practicability and low efficiency of machine reading understanding in the financial field.

The invention achieves a technical solution of the above purpose: machine reading understanding method of financial public opinion research and newspaper is characterized by comprising the following steps:

data formulation and collection, corresponding to the requirements of the financial vertical field, predefining a question set of a user, and collecting public opinion data associated with the question set;

training data labeling, namely finding out data which is relevant to the problems in a predefined problem set from public sentiment data through keyword matching, screening out sentences containing problem answers in the data by utilizing a supervised model, and labeling the data;

constructing a deep learning model, namely acquiring vector representation of characters by using a pre-trained BERT model in the financial field, and interacting data and problems by using an attention mechanism in a natural language processing method to obtain fusion vector representation which can be understood by a computer;

and (4) answer organization, wherein more than two answers fed back by the deep learning model are logically combined.

The other technical solution of the invention for realizing the above purpose is as follows: machine of finance public opinion research and newspaper reads understanding system, its characterized in that includes:

the data formulating and collecting unit is used for predefining a question set of a user corresponding to the requirements of the financial vertical field and collecting public opinion data associated with the question set;

the training data labeling unit is used for finding out data which is relevant to the problems in the predefined problem set from the public sentiment data through keyword matching, screening out sentences containing problem answers in the data by utilizing a supervised model, and labeling the data;

the deep learning model building unit is used for acquiring vector representation of characters by utilizing a pre-trained BERT model in the financial field, and then interacting data and problems by an attention mechanism in a natural language processing method to obtain fusion vector representation which can be understood by a computer;

and the answer organization unit is used for logically combining more than two answers fed back by the deep learning model.

The new technical solution for detecting the target provided by the invention has obvious progress: the method and the system utilize a supervised model of high-quality labeled data, and improve the accuracy of machine reading understanding; for the input data of the last thousand characters, the processing speed is shortened to 500 ms/time, more emphasis is placed on judging whether the collected data has content points which can be used for answering the questions, and the effect of expert rule type question answering can be achieved by using lower cost.

Drawings

Fig. 1 is a topological schematic of a typical machine reading understanding system.

FIG. 2 is a schematic diagram of the main steps of the machine reading understanding method of the present invention.

Fig. 3 is a detailed flow chart of the machine reading understanding method of the present invention.

Detailed Description

The following detailed description of the embodiments of the present invention is provided in conjunction with the accompanying drawings to make the technical solution of the present invention easier to understand and grasp, so as to define the protection scope of the present invention more clearly.

Aiming at the technical development level of the current machine reading understanding and the insufficient current situation that the current machine reading understanding cannot meet the related requirements of the financial field, the invention innovatively provides a method and a system for machine reading understanding of the financial field based on a supervised deep learning algorithm, so as to solve the problems of insufficient accuracy and practicability and low efficiency of machine reading understanding of the financial field

The machine reading understanding method in the financial field is shown in fig. 2 and mainly comprises four main steps of data formulation and collection, training data labeling, deep learning model construction and answer organization. And the detailed flow implementation structure is shown in fig. 3.

In summary understanding of each step, data specification and collection refers to defining questions which may be asked by a user in advance according to requirements of the financial vertical field, screening out two parts of key questions and common questions by setting a screening threshold value related to the amount of the questions, and searching public opinion data such as news and research reports related to the questions through a web crawler.

The training data labeling refers to finding out data which is relevant to a predefined key problem from the collected public sentiment data through keyword matching, and delivering the data for manual labeling.

The deep learning model construction means that a proper model which can solve the problems needs to be constructed for the prepared training data. Conventional machine learning models do not process such document data well, requiring deep learning models of large scale parameters and structures to process. According to the scheme, a BERT (bidirectional Encoder retrieval from transformations) model obtained by pre-training in the financial field is used for obtaining vector representation of characters, and the model is characterized by good character processing effect, small model and high efficiency aiming at the financial field; and secondly, interacting data and key problems through an Attention mechanism (Attention) in a natural language processing technology to obtain a fusion vector representation which can be understood by a computer.

The sentences containing all key question answers in the data can be screened out by utilizing the stability of the deep learning model (with the supervision function). It should be noted that, when there is no answer related to a key question in a certain piece of data, the corresponding article is labeled as a zero answer set "answer", that is, unlabeled data, which is a key point that can identify a question that cannot be answered. Since this step has a great influence on the deep learning model, the labeling result of the data needs to be manually screened to avoid errors.

The answer organization refers to a built public opinion database and a trained deep learning model, and an answer is returned because the task of the model is reading understanding, namely inputting an input (data and question) in the form of input. This form is not intuitive for human review or summarization, and requires the formulation of an answer organization strategy that logically combines multiple answers. The more specific answer organization process is as follows: selecting one of more keyword text similarity matching algorithms for recalling the first ten data of any problem; II, inquiring all sub-questions or key words of the corresponding questions of the first ten data one by one through the built deep learning model, and obtaining the best answer of each data corresponding to all the sub-questions; III, optimally sorting the answers of the subproblems and comparing the answers with the sorting of the recall data; and IV, taking the splicing result of the first two non-empty answers of one of the sub-questions as the component of the corresponding sub-question in the final answer. The answers obtained by logical organization are more suitable for the reading impression of human beings.

The keyword text similarity matching algorithm has the possibility of diversified selection, and is based on the problem word vector consulted by the user

Public opinion data contained article word vector set

Where d represents the number of articles recalled and k represents the word vector dimension.

Alternative keyword text similarity matching algorithms include: 1. calculating the Euclidean distance:

；

2. calculating cosine distance:

；

3. calculating the Jacard similarity coefficient:

wherein Q represents the original text of the question and P represents the original text of the article;

4. pearson correlation coefficient:

。

the system is realized by programming and modifying a computer corresponding to the machine reading and understanding method. The system architecture body formed by the specific programming comprises the following four parts: the data formulating and collecting unit is used for predefining a question set of a user corresponding to the requirements of the financial vertical field and collecting public opinion data associated with the question set; through a manual input interface of a computer, the user inputs the problems related to the financial field into a background database and formats and stores the problems, and a screening threshold value can be set for screening key problems and common problems of a predefined problem set. (ii) a And accessing the internet cloud data through a network input interface, collecting various information and research reports associated with the question set, and storing the information and the research reports in an independent database in a data-by-data (different lengths).

And the training data labeling unit is used for finding out data which is relevant to important questions in the predefined question set from the public opinion data through keyword matching, screening out sentences containing question answers in the data by utilizing a supervised model, and labeling the data. The mass data processed by the unit are labeled and classified, and higher fine-grained support is provided for the machine learning process of the subsequent deep learning model.

The deep learning model building unit specifically realizes the following description of data and problem interaction:

the former module of the unit obtains the character vector representation through a BERT model pre-trained in the financial field, and comprises the following input: questions of user consultation

(ii) a Related articles

Wherein

Is a collection of articles that are,

(ii) a And (3) outputting: problem word vector representation

(ii) a Article word vector representation

Wherein

Is a set of word vectors for an article,

。

the process is as follows: initializing the identifiers [ CLS ], [ SEP ], and executing according to the following program flow:

。

the latter module of the unit interacts data and questions through attention mechanism in natural language processing method, including input: hidden layer output of BERT

(ii) a And (3) outputting: the location of the beginning and end of the answer to the question in the article

。

The process is as follows: the output Q, P of the previous section of modules is obtained and executed as follows:

。

and the answer organization unit is used for logically combining more than two answers fed back by the deep learning model, and the detailed description of the specific logic organization process is omitted. And the result of the answer organization is presented through an interface which is externally output by the computer.

From a more intuitive, pictorial example: when a computer system applying the machine reading understanding method of the financial public opinion research and report inputs a problem of 'big plate rising and falling conditions' in a problem input program. And the public opinion data which can be collected through internet access is large in scale, so that ten pieces of most relevant data in the database are recalled through keyword matching algorithms such as 'big dish', 'trend', 'fluctuation', and the like, the ten pieces of data are respectively merged with problems and serve as a built deep learning model to carry out data input for machine reading understanding, and the answer of each piece of data is obtained. And finally, combining the answer processing by using an answer organization interface to obtain a final answer suitable for the human reading perception.

Similarly, the problems of financial network security, scientific plate stock movement and the like are all suitable for the operation and realization of the machine reading understanding method exemplarily described in the previous paragraph.

In summary, the machine reading understanding method and system for applying the financial opinion research of the present invention can be seen in detail in conjunction with the illustrated embodiments, which have outstanding substantive features and significant progress. The method and the system utilize a supervised model of high-quality labeled data, and improve the accuracy of machine reading understanding; for the input data of the last thousand characters, the processing speed is shortened to 500 ms/time, more emphasis is placed on judging whether the collected data has content points which can be used for answering the questions, and the effect of expert rule type question answering can be achieved by using lower cost.

In addition to the above embodiments, the present invention may have other embodiments, and any technical solutions formed by equivalent substitutions or equivalent transformations are within the scope of the present invention as claimed.

Claims

1. A machine reading understanding method of financial public opinion research and newspaper is characterized by comprising the following steps:

2. The machine-readable understanding method of financial public opinion research and report as claimed in claim 1, wherein: and setting a screening threshold value in data formulation and collection, and screening key problems and common problems for a predefined problem set.

3. The machine-readable understanding method of financial public opinion research and report as claimed in claim 1, wherein: in the training data labeling, for the part of data which is not found to be relevant to the questions in the predefined question set, labeling as a zero answer set.

4. The machine-readable understanding method of financial public opinion research according to claim 1 or 3, wherein: in the training data labeling, manual screening is carried out on the labeled data.

5. The machine-readable understanding method of financial public opinion research and report as claimed in claim 1, wherein: the answer organization process comprises the following steps:

selecting one of more keyword text similarity matching algorithms for recalling the first ten data of any problem;

II, inquiring all sub-questions or key words of the corresponding questions of the first ten data one by one through the built deep learning model, and obtaining the best answer of each data corresponding to all the sub-questions;

III, optimally sorting the answers of the subproblems and comparing the answers with the sorting of the recall data;

and IV, taking the splicing result of the first two non-empty answers of one of the sub-questions as the component of the corresponding sub-question in the final answer.

6. A machine reading understanding system of finance public opinion research and newspaper is characterized by comprising:

7. The system of machine-readable understanding of financial public opinion research as claimed in claim 6, wherein: and a screening threshold value is set in the data formulating and collecting unit and is used for screening the key problems and the common problems of the predefined problem set.

8. The system of machine-readable understanding of financial public opinion research as claimed in claim 6, wherein: the training data labeling unit also comprises a labeling module used for labeling the part of data which is not found out to be relevant to the questions in the predefined question set with a zero answer set.