CN110134781A

CN110134781A - A kind of automatic abstracting method of finance text snippet

Info

Publication number: CN110134781A
Application number: CN201910281459.6A
Authority: CN
Inventors: 蔡青林
Original assignee: Golden State Yongfu Asset Management Ltd
Current assignee: Golden State Yongfu Asset Management Ltd
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2019-08-16

Abstract

The invention discloses a kind of automatic abstracting methods of financial text snippet, sentence keyword attribute is extracted first with TF_ISF method, then the emotion attribute of sentence and the topic relativity of computing statement are extracted, it is given a mark by weighting and evaluates the significance level that sentence is made a summary in emotion, abstract sentence Candidate Set is finally filtered according to method for measuring similarity, generates final emotion abstract.The present invention can extract the emotion abstract of financial text automatically, there is biggish application value intelligently throwing the financial technology fields such as Gu, financial institution's analyst's viewpoint that count off is contained in is ground as extracted and summarizing magnanimity automatically, there is important directive function to major class Asset Allocation.

Description

A kind of automatic abstracting method of finance text snippet

Technical field

The present invention relates to the fields such as financial technology, data mining, information retrieval more particularly to a kind of financial text snippets certainly Dynamic abstracting method.

Background technique

With the fast development of information technology and the arriving of big data era, the automatic processing problems demand of Financial Information It solves, the method for traditional artificial Extracting Information has been far from satisfying the demand of investor.Financial Information is mainly derived from non-knot The text data of structure, such as enterprise annual reports, bulletin, news, policies and regulations, market grind report, effectively excavate this type of information to gold The development for melting business has important value.In this context, text summarization technology starts to get more and more people's extensive concerning.

Text automatic abstracting method is broadly divided into two classes: autoabstract based on semantic understanding and based on word frequency statistics from Dynamic abstract.The former need by related fields corpus and natural language processing semantic analysis come understand original text and generate text It plucks, limitation is larger, and technology is not mature enough；The latter is based primarily upon text structural information, however the text feature letter extracted Cease it is not comprehensive, sentence redundancy, it is discontinuous the problems such as it is more prominent.

Foreign countries are more early to the research starting of autoabstract, and the achievement obtained at present is more.Wherein, method has base earlier In the method for high frequency words marking；Method based on sentence position and clue word feature；Based on sentence information content, continuity and similar Property, the method for abstract sentence is ranked up and selected to sentence using integral linear programming；Semantic close is combined using probability statistics The method of system；Pass through the method etc. for combining the integrated informations such as sentence length and similitude to give a mark sentence.Newest method There is the method based on LexRank, sentence weight is passed through graphical representation, calculating adjacent node with vector space model by this method Sentence similarity, extract with the maximum sentence of adjacent node similarity as digest candidate sentence generate abstract；In addition, there are also bases Sentence is indicated with node in the concept of figure, sentence relationship is indicated with side to construct the automaticabstracting of complex network.

Compared to foreign countries, the country starts late to the research of autoabstract, and current more typical method has to be weighed based on descriptor The method of weight and feature weight；The more document emotion method of abstracting of Chinese based on PageRank；There are also have prison based on LDA model Educational inspector's learning method etc..The system of comparative maturity has OA Chinese literature autoabstract of the Shanghai Communications University based on apery algorithm at present System；Harbin Institute of Technology is based on semantic analysis, HIT-971 English automatic abstracting system of understanding etc..

Summary of the invention

The problem to be solved in the present invention is how to extract the emotion abstract of financial text automatically.In order to solve this problem, originally Invention proposes a kind of automatic abstracting method of financial text snippet.

The purpose of the present invention is what is be achieved through the following technical solutions: a kind of automatic abstracting method of finance text snippet, packet Include following steps:

(1) data prediction specifically includes following sub-step:

(1.1) it is successively read each text d of financial text corpus_i；

(1.2) it reads and deactivates dictionary, delete text d_iIn all stop words；

(1.3) financial vocabulary ontology is read, to d_iEach sentence of content segments, and participle sentence is generated, to d_iTitle point Word generates participle title；

(2) emotion critical sentence extracts, and specifically includes following sub-step:

(2.1) for each vocabulary w_i, successively count text d_iIn include w_iSentence number；

(2.2) d is successively calculated_iIn each sentence s_iKeyword attribute score value key (s_i)；

(2.3) sentiment dictionary is read, successively match statement s_iIn each emotion word, obtain its emotion tendency and emotion Intensity value calculates s_iEmotion attribute score value sent (s_i)；

(2.4) thesaurus is read, successively computing statement s_iWith the same words number and synonym number of title t, calculate Sentence s_iTopic correlativity score value corr (s_i, t)；

(2.5) according to sentence s_iKeyword attribute score value key (s_i), emotion attribute score value sent (s_i), topic correlativity Score value corr (s_i, t) and calculate s_iEmotion give a mark score (s_i)；

(3) autoabstract is extracted, and specifically includes following sub-step:

(3.1) it is given a mark according to emotion by d_iAll sentences sort from high to low, K sentence group is combined into candidate and plucks before extracting Want cand_abs；

(3.2) similarity for calculating every two sentence in cand_abs, if more than threshold value, then by the lower language of emotion score value Sentence is deleted from cand_abs；

(3.3) by the remaining sentence of cand_abs according in urtext d_iThe sequencing of middle appearance sorts, and generates most Whole abstract cand is simultaneously exported.

Further, the step 2.2 includes following sub-step:

(2.2.1) successively counts each vocabulary w_iIn s_iWord frequency, calculate w_iTF-ISF score value, and computing statement s_i's TF-ISF accumulates score value TFISF (s_i)；

(2.2.2) reads indicative word lists, counts sentence s_iIn all indicative word number ind (s_i), computing statement s_iKeyword attribute score value key (s_i)=TFISF (s_i)·ind(s_i)。

Further, in the step 2.3, s_iEmotion attribute score value Wherein ori (ew_{I, k}) it is sentence s_iIn k-th of emotion word emotion tendency, cont (ew_{I, k}) it is sentence s_iIn k-th of emotion word Emotional intensity value, n be sentence s_iIn emotion word number.

Further, in the step 2.4, sentence s_iTopic correlativity score valueWherein sam (s_i, t) and it is sentence s_iWith the same words number of title t, syn (s_i, t) be Sentence s_iWith the synonym number of title t.

Further, in the step 2.5, sentence s_iEmotion give a mark score (s_i)=key (s_i)·sent(s_i)· corr(s_i, t).

Further, in the step 3.2, every two sentence s_iAnd s_jSimilarityWherein sam (s_i, s_j) it is sentence s_iWith sentence s_jSame words number, syn (s_i, s_j) it is sentence s_iWith sentence s_jSynonym number.

The beneficial effects of the present invention are:

1, the marking element that extracting keywords attribute, emotion attribute and topic relativity are made a summary as emotion, from three Different aspect guarantees the information content of feature, improves the accuracy rate of emotion marking and the representativeness of abstract sentence；

2, the measurement standard using TF-ISF as keyword, has effectively distinguished different vocabulary to the significance level of sentence, Play a significant role to filtering out noise vocabulary and improving arithmetic accuracy；

3, method of weighting is used in multiple score functions, and the importance of different characteristic can be adjusted flexibly, ensure that calculation Configuration flexibility of the method in various practical application scenes.

Detailed description of the invention

Fig. 1 is the automatic abstracting method flow chart of financial text snippet.

Specific embodiment

The present invention is described in further detail below in conjunction with the accompanying drawings.

As shown in Figure 1, the present invention provides a kind of financial text snippet automatic abstracting method, comprising the following steps:

(1) data prediction specifically includes following sub-step:

(1.1) it is successively read each text d of financial text corpus Corp_i={ s₁, s₂..., s_N}；

(1.2) it reads and deactivates dictionary, delete text d_iIn all stop words；

(1.3) financial vocabulary ontology is read, to d_iEach sentence s of content_iParticiple generates participle sentence s_i=< w₁, w₂..., w_m>, to d_iTitle t participle, generate participle title t=< w_t1, w_t2..., w_tm>；

(2.1) for each vocabulary w_i, successively count text d_iIn include w_iSentence number n_wi；

(2.2) d is successively calculated_iIn each sentence s_iKeyword attribute score value key (s_i), specifically:

(2.2.1) successively counts each vocabulary w_iIn s_iWord frequency TF (w_i), w is calculated according to formula (1)_iTF-ISF point Value W (w_i), according to formula (2) computing statement s_iTF-ISF accumulate score value TFISF (s_i)；

Wherein W (w_{I, k}) it is sentence s_iIn k-th of vocabulary TF-ISF score value；

(2.2.2) reads indicative word lists, counts sentence s_iIn all indicative word number ind (s_i), according to formula (3) computing statement s_iKeyword attribute score value key (s_i)；

key(s_i)=TFISF (s_i)·ind(s_i) (3)

Indicative word lists specifically include adversative word lists and conjunction word lists etc.；

(2.3) sentiment dictionary is read, successively match statement s_iIn each emotion word ew_i, obtain its emotion tendency ori (ew_i) and emotional intensity value cont (ew_i), s is calculated according to formula (4)_iEmotion attribute score value sent (s_i)；

Wherein ori (ew_{I, k}) it is sentence s_iIn k-th of emotion word emotion tendency, cont (ew_{I, k}) it is sentence s_iMiddle kth The emotional intensity value of a emotion word, n are sentence s_iIn emotion word number.

(2.4) thesaurus is read, successively computing statement s_iWith the same words number sam (s of title t_i, t) and synonym number Mesh syn (s_i, t), according to formula (5) computing statement s_iTopic correlativity score value corr (s_i, t)；

(2.5) according to formula (6) computing statement s_iEmotion give a mark score (s_i)；

score(s_i)=key (s_i)·sent(s_i)·corr(s_i, t) and (6)

(3) autoabstract is extracted, and specifically includes following sub-step:

(3.1) according to emotion marking score (s_i) by d_iAll sentence s₁~s_NIt sorts from high to low, K language before extracting Sentence is as candidate abstract cand_abs=< s₁, s₂..., s_K>；Parameter K according to application scenarios it needs to be determined that, such as may be selected First 5, preceding 10 sentences etc. are as candidate abstract；

(3.2) every two sentence s in cand_abs is calculated according to formula (7)_iAnd s_jSimilarity sim (s_i, s_j), if greatly In threshold value σ, then the lower sentence of emotion score value is deleted from cand_abs；

Threshold value σ is adjusted according to application scenarios, and value range is that the bigger clip Text of 0~1, σ value is more discrete；

(3.3) by the remaining sentence s of cand_abs₁~s_rAccording in urtext d_iThe sequencing of middle appearance sorts, raw At final digest cand and export.

The present invention is directed to the emotion abstract automatic extraction task of financial text, proposes a kind of financial text snippet and takes out automatically Method is taken, automated decision-making system can be effectively improved to the treatment effeciency of financial text information, intelligently throwing the financial technology such as Gu It can play a significant role in field.

Above-described embodiment is used to illustrate the present invention, rather than limits the invention, in spirit of the invention and In scope of protection of the claims, to any modifications and changes that the present invention makes, protection scope of the present invention is both fallen within.

Claims

1. a kind of automatic abstracting method of finance text snippet, which comprises the following steps:

(1) data prediction specifically includes following sub-step:

(1.1) it is successively read each text d of financial text corpus_i；

(1.2) it reads and deactivates dictionary, delete text d_iIn all stop words；

(1.3) financial vocabulary ontology is read, to d_iEach sentence of content segments, and participle sentence is generated, to d_iTitle participle, Generate participle title；

(2.3) sentiment dictionary is read, successively match statement s_iIn each emotion word, obtain its emotion tendency and emotional intensity Value calculates s_iEmotion attribute score value sent (s_i)；

(2.4) thesaurus is read, successively computing statement s_iWith the same words number and synonym number of title t, computing statement s_i Topic correlativity score value corr (s_i, t)；

(2.5) according to sentence s_iKeyword attribute score value key (s_i), emotion attribute score value sent (s_i), topic correlativity score value corr(s_i, t) and calculate s_iEmotion give a mark score (s_i)；

(3) autoabstract is extracted, and specifically includes following sub-step:

(3.1) it is given a mark according to emotion by d_iAll sentences sort from high to low, extract before K sentence group be combined into candidate abstract cand_abs；

(3.2) calculate cand_abs in every two sentence similarity, if more than threshold value, then by the lower sentence of emotion score value from Cand_abs is deleted；

(3.3) by the remaining sentence of cand_abs according in urtext d_iThe sequencing of middle appearance sorts, and generates final digest Cand is simultaneously exported.

2. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that step 2.2 packet Include following sub-step:

(2.2.1) successively counts each vocabulary w_iIn s_iWord frequency, calculate w_iTF-ISF score value, and computing statement s_iTF-ISF Accumulate score value TFISF (s_i)；

(2.2.2) reads indicative word lists, counts sentence s_iIn all indicative word number ind (s_i), computing statement s_i's Keyword attribute score value key (s_i)=TFISF (s_i)·ind(s_i)。

3. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that in the step 2.3, s_iEmotion attribute score valueWherein ori (ew_{I, k}) it is sentence s_iIn k-th of feelings Feel the emotion tendency of word, cont (ew_{I, k}) it is sentence s_iIn k-th of emotion word emotional intensity value, n be sentence s_iIn emotion Word number.

4. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that in the step 2.4, Sentence s_iTopic correlativity score valueWherein sam (s_i, t) and it is sentence s_iWith title t's Same words number, syn (s_i, t) and it is sentence s_iWith the synonym number of title t.

5. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that in the step 2.5, Sentence s_iEmotion give a mark score (s_i)=key (s_i)·sent(s_i)·corr(s_i, t).

6. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that in the step 3.2, Every two sentence s_iAnd s_jSimilarityWherein sam (s_i, s_j) it is sentence s_iWith sentence s_jSame words number, syn (s_i, s_j) it is sentence s_iWith sentence s_jSynonym number.