CN110134781A - A kind of automatic abstracting method of finance text snippet - Google Patents

A kind of automatic abstracting method of finance text snippet Download PDF

Info

Publication number
CN110134781A
CN110134781A CN201910281459.6A CN201910281459A CN110134781A CN 110134781 A CN110134781 A CN 110134781A CN 201910281459 A CN201910281459 A CN 201910281459A CN 110134781 A CN110134781 A CN 110134781A
Authority
CN
China
Prior art keywords
sentence
emotion
score value
financial
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910281459.6A
Other languages
Chinese (zh)
Inventor
蔡青林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Golden State Yongfu Asset Management Ltd
Original Assignee
Golden State Yongfu Asset Management Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Golden State Yongfu Asset Management Ltd filed Critical Golden State Yongfu Asset Management Ltd
Priority to CN201910281459.6A priority Critical patent/CN110134781A/en
Publication of CN110134781A publication Critical patent/CN110134781A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of automatic abstracting methods of financial text snippet, sentence keyword attribute is extracted first with TF_ISF method, then the emotion attribute of sentence and the topic relativity of computing statement are extracted, it is given a mark by weighting and evaluates the significance level that sentence is made a summary in emotion, abstract sentence Candidate Set is finally filtered according to method for measuring similarity, generates final emotion abstract.The present invention can extract the emotion abstract of financial text automatically, there is biggish application value intelligently throwing the financial technology fields such as Gu, financial institution's analyst's viewpoint that count off is contained in is ground as extracted and summarizing magnanimity automatically, there is important directive function to major class Asset Allocation.

Description

A kind of automatic abstracting method of finance text snippet
Technical field
The present invention relates to the fields such as financial technology, data mining, information retrieval more particularly to a kind of financial text snippets certainly Dynamic abstracting method.
Background technique
With the fast development of information technology and the arriving of big data era, the automatic processing problems demand of Financial Information It solves, the method for traditional artificial Extracting Information has been far from satisfying the demand of investor.Financial Information is mainly derived from non-knot The text data of structure, such as enterprise annual reports, bulletin, news, policies and regulations, market grind report, effectively excavate this type of information to gold The development for melting business has important value.In this context, text summarization technology starts to get more and more people's extensive concerning.
Text automatic abstracting method is broadly divided into two classes: autoabstract based on semantic understanding and based on word frequency statistics from Dynamic abstract.The former need by related fields corpus and natural language processing semantic analysis come understand original text and generate text It plucks, limitation is larger, and technology is not mature enough;The latter is based primarily upon text structural information, however the text feature letter extracted Cease it is not comprehensive, sentence redundancy, it is discontinuous the problems such as it is more prominent.
Foreign countries are more early to the research starting of autoabstract, and the achievement obtained at present is more.Wherein, method has base earlier In the method for high frequency words marking;Method based on sentence position and clue word feature;Based on sentence information content, continuity and similar Property, the method for abstract sentence is ranked up and selected to sentence using integral linear programming;Semantic close is combined using probability statistics The method of system;Pass through the method etc. for combining the integrated informations such as sentence length and similitude to give a mark sentence.Newest method There is the method based on LexRank, sentence weight is passed through graphical representation, calculating adjacent node with vector space model by this method Sentence similarity, extract with the maximum sentence of adjacent node similarity as digest candidate sentence generate abstract;In addition, there are also bases Sentence is indicated with node in the concept of figure, sentence relationship is indicated with side to construct the automaticabstracting of complex network.
Compared to foreign countries, the country starts late to the research of autoabstract, and current more typical method has to be weighed based on descriptor The method of weight and feature weight;The more document emotion method of abstracting of Chinese based on PageRank;There are also have prison based on LDA model Educational inspector's learning method etc..The system of comparative maturity has OA Chinese literature autoabstract of the Shanghai Communications University based on apery algorithm at present System;Harbin Institute of Technology is based on semantic analysis, HIT-971 English automatic abstracting system of understanding etc..
Summary of the invention
The problem to be solved in the present invention is how to extract the emotion abstract of financial text automatically.In order to solve this problem, originally Invention proposes a kind of automatic abstracting method of financial text snippet.
The purpose of the present invention is what is be achieved through the following technical solutions: a kind of automatic abstracting method of finance text snippet, packet Include following steps:
(1) data prediction specifically includes following sub-step:
(1.1) it is successively read each text d of financial text corpusi
(1.2) it reads and deactivates dictionary, delete text diIn all stop words;
(1.3) financial vocabulary ontology is read, to diEach sentence of content segments, and participle sentence is generated, to diTitle point Word generates participle title;
(2) emotion critical sentence extracts, and specifically includes following sub-step:
(2.1) for each vocabulary wi, successively count text diIn include wiSentence number;
(2.2) d is successively calculatediIn each sentence siKeyword attribute score value key (si);
(2.3) sentiment dictionary is read, successively match statement siIn each emotion word, obtain its emotion tendency and emotion Intensity value calculates siEmotion attribute score value sent (si);
(2.4) thesaurus is read, successively computing statement siWith the same words number and synonym number of title t, calculate Sentence siTopic correlativity score value corr (si, t);
(2.5) according to sentence siKeyword attribute score value key (si), emotion attribute score value sent (si), topic correlativity Score value corr (si, t) and calculate siEmotion give a mark score (si);
(3) autoabstract is extracted, and specifically includes following sub-step:
(3.1) it is given a mark according to emotion by diAll sentences sort from high to low, K sentence group is combined into candidate and plucks before extracting Want cand_abs;
(3.2) similarity for calculating every two sentence in cand_abs, if more than threshold value, then by the lower language of emotion score value Sentence is deleted from cand_abs;
(3.3) by the remaining sentence of cand_abs according in urtext diThe sequencing of middle appearance sorts, and generates most Whole abstract cand is simultaneously exported.
Further, the step 2.2 includes following sub-step:
(2.2.1) successively counts each vocabulary wiIn siWord frequency, calculate wiTF-ISF score value, and computing statement si's TF-ISF accumulates score value TFISF (si);
(2.2.2) reads indicative word lists, counts sentence siIn all indicative word number ind (si), computing statement siKeyword attribute score value key (si)=TFISF (si)·ind(si)。
Further, in the step 2.3, siEmotion attribute score value Wherein ori (ewI, k) it is sentence siIn k-th of emotion word emotion tendency, cont (ewI, k) it is sentence siIn k-th of emotion word Emotional intensity value, n be sentence siIn emotion word number.
Further, in the step 2.4, sentence siTopic correlativity score valueWherein sam (si, t) and it is sentence siWith the same words number of title t, syn (si, t) be Sentence siWith the synonym number of title t.
Further, in the step 2.5, sentence siEmotion give a mark score (si)=key (si)·sent(si)· corr(si, t).
Further, in the step 3.2, every two sentence siAnd sjSimilarityWherein sam (si, sj) it is sentence siWith sentence sjSame words number, syn (si, sj) it is sentence siWith sentence sjSynonym number.
The beneficial effects of the present invention are:
1, the marking element that extracting keywords attribute, emotion attribute and topic relativity are made a summary as emotion, from three Different aspect guarantees the information content of feature, improves the accuracy rate of emotion marking and the representativeness of abstract sentence;
2, the measurement standard using TF-ISF as keyword, has effectively distinguished different vocabulary to the significance level of sentence, Play a significant role to filtering out noise vocabulary and improving arithmetic accuracy;
3, method of weighting is used in multiple score functions, and the importance of different characteristic can be adjusted flexibly, ensure that calculation Configuration flexibility of the method in various practical application scenes.
Detailed description of the invention
Fig. 1 is the automatic abstracting method flow chart of financial text snippet.
Specific embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
As shown in Figure 1, the present invention provides a kind of financial text snippet automatic abstracting method, comprising the following steps:
(1) data prediction specifically includes following sub-step:
(1.1) it is successively read each text d of financial text corpus Corpi={ s1, s2..., sN};
(1.2) it reads and deactivates dictionary, delete text diIn all stop words;
(1.3) financial vocabulary ontology is read, to diEach sentence s of contentiParticiple generates participle sentence si=< w1, w2..., wm>, to diTitle t participle, generate participle title t=< wt1, wt2..., wtm>;
(2) emotion critical sentence extracts, and specifically includes following sub-step:
(2.1) for each vocabulary wi, successively count text diIn include wiSentence number nwi
(2.2) d is successively calculatediIn each sentence siKeyword attribute score value key (si), specifically:
(2.2.1) successively counts each vocabulary wiIn siWord frequency TF (wi), w is calculated according to formula (1)iTF-ISF point Value W (wi), according to formula (2) computing statement siTF-ISF accumulate score value TFISF (si);
Wherein W (wI, k) it is sentence siIn k-th of vocabulary TF-ISF score value;
(2.2.2) reads indicative word lists, counts sentence siIn all indicative word number ind (si), according to formula (3) computing statement siKeyword attribute score value key (si);
key(si)=TFISF (si)·ind(si) (3)
Indicative word lists specifically include adversative word lists and conjunction word lists etc.;
(2.3) sentiment dictionary is read, successively match statement siIn each emotion word ewi, obtain its emotion tendency ori (ewi) and emotional intensity value cont (ewi), s is calculated according to formula (4)iEmotion attribute score value sent (si);
Wherein ori (ewI, k) it is sentence siIn k-th of emotion word emotion tendency, cont (ewI, k) it is sentence siMiddle kth The emotional intensity value of a emotion word, n are sentence siIn emotion word number.
(2.4) thesaurus is read, successively computing statement siWith the same words number sam (s of title ti, t) and synonym number Mesh syn (si, t), according to formula (5) computing statement siTopic correlativity score value corr (si, t);
(2.5) according to formula (6) computing statement siEmotion give a mark score (si);
score(si)=key (si)·sent(si)·corr(si, t) and (6)
(3) autoabstract is extracted, and specifically includes following sub-step:
(3.1) according to emotion marking score (si) by diAll sentence s1~sNIt sorts from high to low, K language before extracting Sentence is as candidate abstract cand_abs=< s1, s2..., sK>;Parameter K according to application scenarios it needs to be determined that, such as may be selected First 5, preceding 10 sentences etc. are as candidate abstract;
(3.2) every two sentence s in cand_abs is calculated according to formula (7)iAnd sjSimilarity sim (si, sj), if greatly In threshold value σ, then the lower sentence of emotion score value is deleted from cand_abs;
Threshold value σ is adjusted according to application scenarios, and value range is that the bigger clip Text of 0~1, σ value is more discrete;
(3.3) by the remaining sentence s of cand_abs1~srAccording in urtext diThe sequencing of middle appearance sorts, raw At final digest cand and export.
The present invention is directed to the emotion abstract automatic extraction task of financial text, proposes a kind of financial text snippet and takes out automatically Method is taken, automated decision-making system can be effectively improved to the treatment effeciency of financial text information, intelligently throwing the financial technology such as Gu It can play a significant role in field.
Above-described embodiment is used to illustrate the present invention, rather than limits the invention, in spirit of the invention and In scope of protection of the claims, to any modifications and changes that the present invention makes, protection scope of the present invention is both fallen within.

Claims (6)

1. a kind of automatic abstracting method of finance text snippet, which comprises the following steps:
(1) data prediction specifically includes following sub-step:
(1.1) it is successively read each text d of financial text corpusi
(1.2) it reads and deactivates dictionary, delete text diIn all stop words;
(1.3) financial vocabulary ontology is read, to diEach sentence of content segments, and participle sentence is generated, to diTitle participle, Generate participle title;
(2) emotion critical sentence extracts, and specifically includes following sub-step:
(2.1) for each vocabulary wi, successively count text diIn include wiSentence number;
(2.2) d is successively calculatediIn each sentence siKeyword attribute score value key (si);
(2.3) sentiment dictionary is read, successively match statement siIn each emotion word, obtain its emotion tendency and emotional intensity Value calculates siEmotion attribute score value sent (si);
(2.4) thesaurus is read, successively computing statement siWith the same words number and synonym number of title t, computing statement si Topic correlativity score value corr (si, t);
(2.5) according to sentence siKeyword attribute score value key (si), emotion attribute score value sent (si), topic correlativity score value corr(si, t) and calculate siEmotion give a mark score (si);
(3) autoabstract is extracted, and specifically includes following sub-step:
(3.1) it is given a mark according to emotion by diAll sentences sort from high to low, extract before K sentence group be combined into candidate abstract cand_abs;
(3.2) calculate cand_abs in every two sentence similarity, if more than threshold value, then by the lower sentence of emotion score value from Cand_abs is deleted;
(3.3) by the remaining sentence of cand_abs according in urtext diThe sequencing of middle appearance sorts, and generates final digest Cand is simultaneously exported.
2. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that step 2.2 packet Include following sub-step:
(2.2.1) successively counts each vocabulary wiIn siWord frequency, calculate wiTF-ISF score value, and computing statement siTF-ISF Accumulate score value TFISF (si);
(2.2.2) reads indicative word lists, counts sentence siIn all indicative word number ind (si), computing statement si's Keyword attribute score value key (si)=TFISF (si)·ind(si)。
3. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that in the step 2.3, siEmotion attribute score valueWherein ori (ewI, k) it is sentence siIn k-th of feelings Feel the emotion tendency of word, cont (ewI, k) it is sentence siIn k-th of emotion word emotional intensity value, n be sentence siIn emotion Word number.
4. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that in the step 2.4, Sentence siTopic correlativity score valueWherein sam (si, t) and it is sentence siWith title t's Same words number, syn (si, t) and it is sentence siWith the synonym number of title t.
5. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that in the step 2.5, Sentence siEmotion give a mark score (si)=key (si)·sent(si)·corr(si, t).
6. the automatic abstracting method of a kind of financial text snippet according to claim 1, which is characterized in that in the step 3.2, Every two sentence siAnd sjSimilarityWherein sam (si, sj) it is sentence siWith sentence sjSame words number, syn (si, sj) it is sentence siWith sentence sjSynonym number.
CN201910281459.6A 2019-04-09 2019-04-09 A kind of automatic abstracting method of finance text snippet Pending CN110134781A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910281459.6A CN110134781A (en) 2019-04-09 2019-04-09 A kind of automatic abstracting method of finance text snippet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910281459.6A CN110134781A (en) 2019-04-09 2019-04-09 A kind of automatic abstracting method of finance text snippet

Publications (1)

Publication Number Publication Date
CN110134781A true CN110134781A (en) 2019-08-16

Family

ID=67569516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910281459.6A Pending CN110134781A (en) 2019-04-09 2019-04-09 A kind of automatic abstracting method of finance text snippet

Country Status (1)

Country Link
CN (1) CN110134781A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401045A (en) * 2020-03-16 2020-07-10 腾讯科技(深圳)有限公司 Text generation method and device, storage medium and electronic equipment
CN112784585A (en) * 2021-02-07 2021-05-11 新华智云科技有限公司 Abstract extraction method and terminal for financial bulletin
CN114417821A (en) * 2022-03-29 2022-04-29 南昌华梦达航空科技发展有限公司 Financial text checking and analyzing system based on cloud platform
IT202200007820A1 (en) 2022-04-20 2022-07-20 Orma Lab Srl SYSTEM AND METHOD FOR THE AUTOMATIC SUGGESTION OF FACILITATED FINANCE INSTRUMENTS WITH IMPROVEMENT OF REPUTATIONAL PERFORMANCE

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130103623A1 (en) * 2011-10-21 2013-04-25 Educational Testing Service Computer-Implemented Systems and Methods for Detection of Sentiment in Writing
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105022725A (en) * 2015-07-10 2015-11-04 河海大学 Text emotional tendency analysis method applied to field of financial Web

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130103623A1 (en) * 2011-10-21 2013-04-25 Educational Testing Service Computer-Implemented Systems and Methods for Detection of Sentiment in Writing
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105022725A (en) * 2015-07-10 2015-11-04 河海大学 Text emotional tendency analysis method applied to field of financial Web

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李宪毅: "面向评论文本的多文档情感摘要研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401045A (en) * 2020-03-16 2020-07-10 腾讯科技(深圳)有限公司 Text generation method and device, storage medium and electronic equipment
CN111401045B (en) * 2020-03-16 2022-05-10 腾讯科技(深圳)有限公司 Text generation method and device, storage medium and electronic equipment
CN112784585A (en) * 2021-02-07 2021-05-11 新华智云科技有限公司 Abstract extraction method and terminal for financial bulletin
CN114417821A (en) * 2022-03-29 2022-04-29 南昌华梦达航空科技发展有限公司 Financial text checking and analyzing system based on cloud platform
IT202200007820A1 (en) 2022-04-20 2022-07-20 Orma Lab Srl SYSTEM AND METHOD FOR THE AUTOMATIC SUGGESTION OF FACILITATED FINANCE INSTRUMENTS WITH IMPROVEMENT OF REPUTATIONAL PERFORMANCE

Similar Documents

Publication Publication Date Title
AU2020103654A4 (en) Method for intelligent construction of place name annotated corpus based on interactive and iterative learning
CN110134781A (en) A kind of automatic abstracting method of finance text snippet
Wang et al. Integrating extractive and abstractive models for long text summarization
Han et al. Lexical normalization for social media text
CN103049435B (en) Text fine granularity sentiment analysis method and device
CN105243129A (en) Commodity property characteristic word clustering method
CN104462378A (en) Data processing method and device for text recognition
CN103914494A (en) Method and system for identifying identity of microblog user
CN110807326B (en) Short text keyword extraction method combining GPU-DMM and text features
CN110738033B (en) Report template generation method, device and storage medium
CN110309400A (en) A kind of method and system that intelligent Understanding user query are intended to
CN111626042B (en) Reference digestion method and device
CN102955771A (en) Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
CN103514213A (en) Term extraction method and device
CN109086355A (en) Hot spot association relationship analysis method and system based on theme of news word
CN109815401A (en) A kind of name disambiguation method applied to Web people search
CN111310467B (en) Topic extraction method and system combining semantic inference in long text
Ao et al. News keywords extraction algorithm based on TextRank and classified TF-IDF
Laboreiro et al. Determining language variant in microblog messages
CN112528640A (en) Automatic domain term extraction method based on abnormal subgraph detection
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN115617965A (en) Rapid retrieval method for language structure big data
CN108595434B (en) Syntax dependence method based on conditional random field and rule adjustment
CN111767730A (en) Event type identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190816

RJ01 Rejection of invention patent application after publication