CN113128238A

CN113128238A - Financial information semantic analysis method and system based on natural language processing technology

Info

Publication number: CN113128238A
Application number: CN202110469467.0A
Authority: CN
Inventors: 方正平
Original assignee: Anhui Zhiyuxin Information Technology Co ltd
Current assignee: Anhui Zhiyuxin Information Technology Co ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-16
Anticipated expiration: 2041-04-28
Also published as: CN113128238B

Abstract

The invention discloses a financial information semantic analysis method and system based on natural language processing technology, and relates to the technical field of natural language processing. According to the financial information semantic analysis method and system based on the natural language processing technology, BERT model parameters in a BERT + CRF module are fixed, CRF related model parameters are trained, after a good effect is obtained, the recognition rate is high through the combination of the BERT model and the CRF model, KW-E special characters are added before and after keywords transmitted by a splitting summary module through an adding module, at the moment, label names and summaries are spliced together through SEP characters by a splicing module, output word vectors are input into a two-layer fully-connected neural network through the RT BEmodel in a network connection module, finally 195 sigmoid binary tasks are connected behind the word vectors, and labels for correcting financial events are associated with companies through the splicing module, so that the system efficiency is improved.

Description

Financial information semantic analysis method and system based on natural language processing technology

Technical Field

The invention relates to the technical field of natural language processing, in particular to a financial information semantic analysis method and a financial information semantic analysis system based on natural language processing technology.

Background

In the current society, tens of thousands of company financial public opinion data are generated every day, and people are difficult to extract and digest the information in a short time. These financial public opinion data are automatically structured in a short time by the related technology of natural language processing, which is an important direction in the fields of computer science and artificial intelligence, to facilitate human analysis. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics.

In the current financial information semantic analysis process, the identification rate of the BilSTM is low mainly based on a BilSTM + CRF model; in terms of label classification, most systems do not depend on labels on objects, the number of the labels is generally between 10 and 30, but the label classification without the dependence cannot correspond the company and the labels; for this reason, a financial information semantic analysis method and system based on natural language processing technology are provided by those skilled in the art.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a financial information semantic analysis method and a system based on a natural language processing technology, which solves the problems that the recognition rate of BilSTM is low or the company and the label cannot be corresponded by label classification without dependence; too few tags may not satisfy the business requirements.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme: a financial information semantic analysis method based on natural language processing technology specifically comprises the following steps:

s1, firstly, collecting a batch of news data from the network through a data collection module, then, using a duplication removal module to duplicate the collected news data in a simhash manner, ensuring that the duplicated data are between interval values of 9000 plus 10001, then, independently splitting sentences in each news by a splitting abstract module to be used as abstract sentences, marking the positions of company name characters in the abstract sentences, and simultaneously, transmitting the data into a BIO labeling module and an adding module;

s2, converting the positions of the company name characters into BIO labels through a BIO labeling module, marking the first character at the beginning of each company name as B, marking other characters as I, marking other characters in a sentence as O, training model parameters related to CRF when BERT model parameters in a BERT + CRF module are fixed, combining the BERT model and the CRF model for fine adjustment after a better effect is obtained, and finally obtaining a better result of F1 score, wherein F1 score is a calculation result comprehensively considering model precision and recall, the larger the F1-score is, the higher the quality of the model is naturally explained, and the company names (possibly full names, possibly short names and possibly brand names) in the data are extracted through an extracting module;

s3, according to the existing database screening module of the company, the extracted company name is corresponding to the company full name, when the existing database screening module is the company full name which is screened to be corresponding to the company name, the screening range is expanded through the network screening module until the company name is extracted from the network and is corresponding to the company full name, then the company full name is extracted through the result corresponding module and is transmitted into the splicing module, meanwhile, the adding module adds KW-E special characters before and after the key word transmitted from the splitting abstract module, at the moment, the splicing module splices the label name and the abstract together by SEP characters, the output word vector is input into the two layers of fully connected neural networks through a BERT model in the network connection module, and finally, 195 Sigmoid binary tasks are connected afterwards, the Sigmoid function is a common S-type function in biology, in the properties of simple increment and inverse function simple increment, the Sigmoid function is often used as a threshold function of a neural network, a variable is mapped between 0 and 1, a tag for correcting a financial event is associated with a company through a docking module, finally all tags of the company are counted for a period of time through a data collection module, and a marketing risk index of the company is calculated on a risk calculation module according to the weight of the tags.

The utility model provides a financial information semantic analysis system based on natural language processing technique, includes data preprocessing unit, the first output of data preprocessing unit is connected with entity identification unit's input, the second output of data preprocessing unit is connected with the first input of label classification unit, the output of entity identification unit is connected with the input that the unit was linked to the entity, the output that the unit was linked to the entity is connected with the second input of label classification unit, the output and the input of risk calculation unit of label classification unit are connected.

Preferably, the data preprocessing unit comprises a data acquisition module, a deduplication module and a splitting summary module, wherein the output end of the data acquisition module is connected with the input end of the deduplication module, and the output end of the deduplication module is connected with the input end of the splitting summary module.

Preferably, the entity identification unit includes a BIO labeling module, a BERT + CRF module, and an extracting module, an output end of the BIO labeling module is connected to an input end of the BERT + CRF module, and an output end of the BERT + CRF module is connected to an input end of the extracting module.

Preferably, the entity linking unit includes a database screening module, a network screening module and a result corresponding module, a first output end of the database screening module is connected with an input end of the network screening module, a second output end of the database screening module is connected with a first input end of the result corresponding module, and an output end of the network screening module is connected with a second input end of the result corresponding module.

Preferably, the label classification unit comprises an adding module, a splicing module, a network connection module and a butt joint module, wherein the output end of the adding module is connected with the first input end of the splicing module, the output end of the splicing module is connected with the input end of the network connection module, and the output end of the network connection module is connected with the input end of the butt joint module.

Preferably, the risk calculation unit comprises a data collection module and a risk calculation module.

Preferably, the output end of the splitting abstract module is connected with the input end of the adding module.

Preferably, the output end of the result corresponding module is connected with the second input end of the splicing module.

Preferably, the output end of the data collection module is connected with the input end of the risk calculation module.

(III) advantageous effects

The invention provides a financial information semantic analysis method and system based on natural language processing technology. The method has the following beneficial effects:

(1) according to the financial information semantic analysis method and system based on the natural language processing technology, the BERT model parameters in the BERT + CRF module are fixed, the CRF related model parameters are trained, after a better effect is obtained, the BERT model and the CRF model are combined together for fine adjustment, a better result of F1 score is finally obtained, the company name in the data is extracted through the extraction module, so that the recognition rate is higher through the combination of the BERT model and the CRF model, and the recognition effect is improved.

(2) According to the financial information semantic analysis method and system based on the natural language processing technology, the adding module is used for adding KW-E special characters before and after the key words transmitted by the splitting abstract module, at the moment, the splicing module is used for splicing the label name and the abstract together by using SEP characters, and the label for correcting the financial event is associated with a company through the network connection module and the butt joint module, so that special symbols and label classification can be added, the label of the financial event is rapidly associated with the company, and the system efficiency is improved.

(3) According to the financial information semantic analysis method and system based on the natural language processing technology, the output word vectors are input into the two-layer fully-connected neural network through a BERT model in the network connection module, and finally, 195 sigmoid binary tasks are connected behind the word vectors, so that the accuracy of correlation between the tags of financial events and companies is improved through the 195 sigmoid binary tasks.

Drawings

FIG. 1 is a system schematic block diagram of the system of the present invention;

FIG. 2 is a system schematic block diagram of a data preprocessing unit of the present invention;

FIG. 3 is a system schematic block diagram of an entity identification unit of the present invention;

FIG. 4 is a system schematic block diagram of a tag sorting unit of the present invention;

FIG. 5 is a system schematic block diagram of the entity linking unit of the present invention;

FIG. 6 is a system schematic block diagram of a risk calculation unit of the present invention;

in the figure, 1, a data preprocessing unit; 2. an entity identification unit; 3. a label classification unit; 4. an entity linking unit; 5. a risk calculation unit; 6. a data acquisition module; 7. a duplicate removal module; 8. splitting the abstract module; 9. a BIO labeling module; 10. a BERT + CRF module; 11. an extraction module; 12. a database screening module; 13. a network screening module; 14. adding a module; 15. a splicing module; 16. a network connection module; 17. a docking module; 18. a data collection module; 19. a risk calculation module; 20. the result corresponds to the module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-6, the embodiment of the present invention provides two technical solutions:

the first embodiment,

The financial information semantic analysis method based on the natural language processing technology specifically comprises the following steps:

s1, firstly, collecting a batch of news data from the network through the data collection module 6, then using the duplication removal module 7 to duplicate the collected news data in a simhash manner, ensuring that the duplicated data are between interval values of 9000 and 10001, then using the splitting abstract module 8 to separately split sentences in each news as abstract sentences, marking the positions of company name characters in the abstract sentences, and simultaneously transmitting the data into the BIO labeling module 9 and the adding module 14;

s2, converting the positions of the company name characters into BIO labels through a BIO labeling module 9, marking the first character at the beginning of each company name as B, marking other characters as I, marking other characters in a sentence as O, training model parameters related to CRF when BERT model parameters in a BERT + CRF module 10 are fixed, combining the BERT model and the CRF model for fine tuning after a better effect is obtained, finally obtaining a result of F1 score93.15, and extracting the company names (possibly full names, possibly short names and possibly brand names) in the data through an extracting module 11;

s3, according to the existing database screening module 12 of the company, the extracted company name is corresponding to the company full name, when the existing database screening module 12 screens the company full name corresponding to the company name, the screening range is expanded through the network screening module 13 until the company name extracted from the network is corresponding to the company full name, then the company full name is extracted and transmitted into the splicing module 15 through the result corresponding module 20, at the same time, the adding module 14 adds KW-E special characters before and after the key words transmitted from the splitting abstract module 8, at this time, the splicing module 15 splices together the label name and the abstract with SEP characters, the output word vector is input into the two-layer fully-connected neural network through the BERT model in the network connection module 16, and finally 195 sigmoid binary tasks are connected afterwards, and the label of the corrected financial event is associated with the company through the butt module 17, finally, all labels of the company in a period of time are counted through the data collection module 18, and according to the weight of the labels, the marketing risk index of the company is calculated on the risk calculation module 19.

Example II,

As a modification of the previous embodiment,

As a preferred scheme, the financial information semantic analysis system based on the natural language processing technology comprises a data preprocessing unit 1, wherein a first output end of the data preprocessing unit 1 is connected with an input end of an entity identification unit 2, a second output end of the data preprocessing unit 1 is connected with a first input end of a label classification unit 3, an output end of the entity identification unit 2 is connected with an input end of an entity linking unit 4, an output end of the entity linking unit 4 is connected with a second input end of the label classification unit 3, and an output end of the label classification unit 3 is connected with an input end of a risk calculation unit 5.

As a preferred scheme, the data preprocessing unit 1 includes a data acquisition module 6, a deduplication module 7 and a split summary module 8, an output end of the data acquisition module 6 is connected with an input end of the deduplication module 7, and an output end of the deduplication module 7 is connected with an input end of the split summary module 8.

Preferably, the entity identification unit 2 includes a BIO labeling module 9, a BERT + CRF module 10, and an extracting module 11, an output end of the BIO labeling module is connected to an input end of the BERT + CRF module 10, an output end of the BERT + CRF module 10 is connected to an input end of the extracting module 11, BERT model parameters in the BERT + CRF module 10 are fixed, CRF-related model parameters are trained, after a better effect is obtained, the BERT model and the CRF model are combined together for fine tuning, and finally, a result of F1 score93.15 is obtained, and a company name (which may be a full name, a short name, or a brand name) in data is extracted through the extracting module 11.

Preferably, the entity linking unit 4 includes a database screening module 12, a network screening module 13, and a result corresponding module 20, a first output end of the database screening module 12 is connected to an input end of the network screening module 13, a second output end of the database screening module 12 is connected to a first input end of the result corresponding module 20, and an output end of the network screening module 13 is connected to a second input end of the result corresponding module 20.

As a preferred scheme, the tag classification unit 3 includes an adding module 14, a splicing module 15, a network connection module 16 and a docking module 17, an output end of the adding module 14 is connected to a first input end of the splicing module 15, an output end of the splicing module 15 is connected to an input end of the network connection module 16, an output end of the network connection module 16 is connected to an input end of the docking module 17, an output end of the splitting summary module 8 is connected to an input end of the adding module 14, and as a result, an output end of the corresponding module 20 is connected to a second input end of the splicing module 15.

Preferably, the risk calculating unit 5 comprises a data collecting module 18 and a risk calculating module 19, and an output end of the data collecting module 18 is connected with an input end of the risk calculating module 19.

The advantages of the second embodiment over the first embodiment are: the BIO labeling module 9 and the BERT + CRF module 10 enable the recognition rate to be high, output word vectors are input into a two-layer fully-connected neural network through a BERT model in the network connection module 16, 195 sigmoid binary classification tasks are finally connected behind the word vectors, tags for correcting financial events are associated with companies through the docking module 17, and the system efficiency is improved.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation. The statement that an element defined by the phrase "comprises an … … does not exclude the presence of other like elements in the process, method, article, or apparatus that comprises the element.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The financial information semantic analysis method based on the natural language processing technology is characterized by comprising the following steps: the method specifically comprises the following steps:

s1, firstly, collecting a batch of news data from the network through a data collection module (6), then using a duplication removal module (7) to duplicate the collected news data in a simhash manner, ensuring that the duplicated data are between interval values of 9000 plus 10001, then using a splitting abstract module (8) to separately split sentences in each news, taking the sentences as abstract sentences, marking the positions of company name characters in the abstract sentences, and simultaneously transmitting the data into a BIO labeling module (9) and an adding module (14);

s2, converting the positions of the company name characters into BIO labels through a BIO labeling module (9), wherein the first character at the beginning of each company name is labeled as B, other characters are labeled as I, other characters in a sentence are labeled as O, when BERT model parameters in a BERT + CRF module (10) are fixed, the related model parameters of CRF are trained, after a better effect is obtained, the BERT model and the CRF model are combined together for fine adjustment, a better result of F1 score is finally obtained, and the company names in the data are extracted through an extracting module (11);

s3, then, according to the existing database screening module (12) of the company, the extracted company name is corresponding to the company full name, when the existing database screening module (12) screens the company full name corresponding to the company name, the screening range is expanded through the network screening module (13) until the company name extracted from the network is corresponding to the company full name, then the company full name is extracted through the result corresponding module (20) and is transmitted into the splicing module (15), meanwhile, the adding module (14) adds KW-E special characters before and after the key words transmitted from the splitting abstract module (8), at the moment, the splicing module (15) splices the label name and the abstract together with SEP characters, the output word vector is input into the two-layer fully-connected neural network through the BERT model in the network connection module (16), finally, 195 sigmoid binary tasks are followed, the tags for correcting financial events are associated with the company through a docking module (17), all tags of the company in a period of time are counted through a data collecting module (18), and the marketing risk index of the company is calculated on a risk calculating module (19) according to the weight of the tags.

2. A financial intelligence semantic analysis system based on natural language processing technology according to claim 1, comprising a data preprocessing unit (1), characterized by: the first output of data preprocessing unit (1) is connected with the input of entity identification unit (2), the second output of data preprocessing unit (1) is connected with the first input of label classification unit (3), the output of entity identification unit (2) is connected with the input that entity links unit (4), the output that entity links unit (4) is connected with the second input of label classification unit (3), the output of label classification unit (3) is connected with the input of risk calculation unit (5).

3. The natural language processing technology based financial intelligence semantic analysis system of claim 2, wherein: the data preprocessing unit (1) comprises a data acquisition module (6), a duplication removing module (7) and a splitting abstract module (8), the output end of the data acquisition module (6) is connected with the input end of the duplication removing module (7), and the output end of the duplication removing module (7) is connected with the input end of the splitting abstract module (8).

4. The natural language processing technology based financial intelligence semantic analysis system of claim 2, wherein: the entity identification unit (2) comprises a BIO labeling module (9), a BERT + CRF module (10) and an extraction module (11), wherein the output end of the BIO labeling module is connected with the input end of the BERT + CRF module (10), and the output end of the BERT + CRF module (10) is connected with the input end of the extraction module (11).

5. The natural language processing technology based financial intelligence semantic analysis system of claim 2, wherein: the entity linking unit (4) comprises a database screening module (12), a network screening module (13) and a result corresponding module (20), wherein a first output end of the database screening module (12) is connected with an input end of the network screening module (13), a second output end of the database screening module (12) is connected with a first input end of the result corresponding module (20), and an output end of the network screening module (13) is connected with a second input end of the result corresponding module (20).

6. The system of claim 5, wherein the system comprises: the label classification unit (3) is including adding module (14), concatenation module (15), network connection module (16) and butt joint module (17), the output that adds module (14) is connected with the first input of concatenation module (15), the output of concatenation module (15) is connected with the input of network connection module (16), the output of network connection module (16) is connected with the input of butt joint module (17).

7. The natural language processing technology based financial intelligence semantic analysis system of claim 2, wherein: the risk calculation unit (5) comprises a data collection module (18) and a calculated risk module (19).

8. The natural language processing technology based financial intelligence semantic analysis system of claim 6, wherein: the output end of the splitting abstract module (8) is connected with the input end of the adding module (14).

9. The natural language processing technology based financial intelligence semantic analysis system of claim 6, wherein: the output end of the result corresponding module (20) is connected with the second input end of the splicing module (15).

10. The natural language processing technology based financial intelligence semantic analysis system of claim 7, wherein: the output end of the data collection module (18) is connected with the input end of the risk calculation module (19).