CN105930375B

CN105930375B - A kind of data digging method based on XBRL file

Info

Publication number: CN105930375B
Application number: CN201610228600.2A
Authority: CN
Inventors: 冯涛
Original assignee: Yunnan University of Finance and Economics
Current assignee: Yunnan University of Finance and Economics
Priority date: 2016-04-13
Filing date: 2016-04-13
Publication date: 2019-04-02
Anticipated expiration: 2036-04-13
Also published as: CN105930375A

Abstract

The invention discloses a kind of data digging method based on XBRL instance document, includes the following steps: to obtain XBRL instance document, the XBRL instance document is stored using Hadoop platform HDFS file system；The XBRL instance document that the Hadoop platform is stored carries out fragment processing, will be that the XBRL instance document is parsed and generates corresponding Boolean matrix on each fragment by MapReduce technology；The Boolean matrix is subjected to piecemeal processing, the quantity of different elements in the corresponding Boolean matrix of all piecemeals is counted by iterative algorithm, the frequent episode of the XBRL instance document is obtained according to quantity, obtains the data of the corresponding XBRL instance document of the frequent episode.The present invention realizes the storage of magnanimity XBRL instance document by Hadoop platform, after realizing the parsing to XBRL instance document using the use of Map/Reduce function in Hadoop platform simultaneously and generate corresponding Boolean matrix, Boolean matrix is subjected to the calculation amount in piecemeal processing reduction data mining process using Map/Reduce data again, while improving calculating speed.

Description

A kind of data digging method based on XBRL file

Technical field

The invention belongs to field of computer technology, in particular to a kind of data digging method based on XBRL file.

Background technique

XBRL (eXtensible Business Reporting Language, Extensible Business Reporting Language), is one Markup language of the kind based on XML, definition and exchange for business and financial information.XBRL facilitate business information establishment, Analysis and exchange, for provide and using financial data owner provide low cost, it is efficient service and it is reliable and accurate Business information ".XBRL is in an increasingly wide range of applications all over the world at present, such as stock supervisory committee, the U.S. (SEC), Canada The multinational cards such as stock supervisory committee (CSA), Toronto Stock Exchange, stock exchange, South Korea, Tokyo exchange, Shanghai Exchange, Shenzhen stock exchange Certificate supervisory organ and stock exchange are all using XBRL technology.XBRL technical system is made of 4 parts, is specification respectively (Specification), mode (Schema), classification standard (Taxonomy) and instance document (XBRL Instance).Wherein XBRL specification describes the structure of XBRL file, the syntax and semantics of specified in more detail XBRL classification standard and XBRL instance document； Classification information generally comprises two parts: schema definition document (* .xsd) and chained library (linkbase).Classification information file is used for Schema verifying (schema validation) is provided to instance document；Instance document must satisfy the items rule of its definition. However the thin bulk information of listing enterprise disclosed to the cost that can greatly increase investor and analyst's search and processing information. Therefore, under network information disclosure mechanism how effectively from information announcing data, obtain valid data be this field urgently The problem solved.

Summary of the invention

Data mining in existing XBRL instance document is computationally intensive, slow-footed technological deficiency to solve, and the present invention passes through Hadoop platform realizes the storage of mass data, realizes the parsing to the XBRL instance document simultaneously using MapReduce function Boolean matrix is generated, what it is by the Boolean matrix is that the efficiency for improving financial data and excavating is realized in piecemeal processing.

The present invention provides a kind of data digging methods based on XBRL instance document, include the following steps:

XBRL instance document is obtained, the XBRL instance document is deposited using Hadoop platform HDFS file system Storage；

The XBRL instance document that the Hadoop platform is stored carries out fragment processing, passes through MapReduce technology It will be that the XBRL instance document is parsed and generates corresponding Boolean matrix on each fragment；

The Boolean matrix is subjected to piecemeal processing, is counted in the corresponding Boolean matrix of all piecemeals not by iterative algorithm With the quantity of element, the frequent episode of the XBRL instance document is obtained according to quantity, and it is corresponding described to obtain the frequent episode The data of XBRL instance document.

Further, the method also includes

Preset rating database is inquired, the corresponding grade of frequent episode and its evaluation information in the frequent item set are obtained, The XBRL instance document is assessed, the rating database includes the incidence relation of frequent episode and evaluation information.

Further, the Boolean matrix is that discrete rear generation is carried out to the result of XBRL instance document parsing.

Further, the discrete method includes

The financial data range [a, b] of the XBRL instance document parsing is obtained, wherein a, b are respectively same financial data Minimum value, maximum value；

The financial data range is subjected to interval division using minimum entropy splitting algorithm, by recursive algorithm by dividing regions Between after financial data range carry out interval division again, until be divided into following section:

π_A={ [a₀,a₁],[a₁,a₂],...,[a_k-1,a_k], wherein a0=a, ak=b.

Further, minimum entropy splitting algorithm formula are as follows:

Wherein, m is the number of different element sets in the financial data D of the parsing XBRL instance document acquisition, P_iIt is first in D The probability of plain i,

Section D is divided for D1 and D2, so that reconciling information entropy minimization after D division, following formula is calculated:

To sum up, the present invention realizes the storage of magnanimity XBRL instance document by Hadoop platform, while flat using Hadoop After the use of Map/Reduce function realizes the parsing to the XBRL instance document and generates corresponding Boolean matrix in platform, then It is secondary that the Boolean matrix is subjected to the calculation amount in piecemeal processing reduction data mining process using Map/Reduce data, simultaneously Calculating speed is improved, realizes the purpose for quickly excavating financial data in the XBRL instance document, and carry out accordingly to it Assessment.

Detailed description of the invention

Fig. 1 is the flow diagram of the data digging method of the present invention based on XBRL instance document；

Fig. 2 is the parsing block diagram representation of the present invention based on XBRL instance document；

Fig. 3 is the iterative algorithm of data digging method one embodiment of the present invention based on XBRL instance document Flow diagram.

Specific embodiment

The present invention is described in further detail below by specific embodiment and in conjunction with attached drawing.

The present invention provides a kind of data digging methods based on XBRL instance document.

Described method includes following steps:

S101, obtain XBRL instance document, by the XBRL instance document using Hadoop platform HDFS file system into Row storage.

When it is implemented, all XBRL instance documents are optionally obtained from internet, and by all XBRL Instance document is into storage.Since the quantity of the XBRL instance document is very big, when single server is stored, server is born Carry very big, therefore the present invention is operated using Hadoop platform, i.e., using being stored in HDFS distributed data base.It is this Mode can completely save all information in all XBRL instance documents, while also solve asking for mass data storage Topic is extracted for next step magnanimity XBRL data information, analysis is got ready.

S102, the XBRL instance document for storing the Hadoop platform carry out fragment processing, pass through MapReduce Technology will be that the XBRL instance document is parsed and generates corresponding Boolean matrix on each fragment.

The parsing of the XBRL instance document is parsed according to classification standard.

As shown in Fig. 2, the parsing result schematic diagram of the XBRL instance document.The parsing need of work of XBRL is according to XBRL The files such as schema file (Taxonomy.xsd) and matched References.xml, Presentations.xml, XML parsing operation is carried out to XBRL instance data.Classification information file, which is used to provide Schema to instance document, verifies (schema ), validation instance document must satisfy the items rule of its definition, include: each element occurred in instance document It must be declared in the schema definition document, or the sound in other schema definition documents of schema definition document reference It is bright；The each element occurred in instance document, content all must be with members specified when stating the element in schema definition document Plain type is consistent etc..Chained library is further described to schema definition document, and instance document, which equally must satisfy it, to be determined The items rule of justice.When Map function extracts data from the series of elements of instance document, chained library provides these data Between relationship.

When it is implemented, for example, XBRL tray used by current Shanghai Stock Exchange and Shenzhen Stock Exchange Formula covers the full content of listed company's regular reporting (including annual report, semi-annual report and quarterly report) abstract." on Company, city information announcing digital criterion " in strict accordance with " accounting standards for enterprises " (revised edition in 2006) and China Securities supervision pipe The reason committee works out about the related request of Disclosed Information From Public Company series criterion, stringent according to the specification of XBRL 2.1 and FRTA The correlation of (Financial Reporting Taxonomies Architecture, that is, business report classification information frame) 1.0 Regulation is write.The classification information common pattern of latest edition defines file (Schema) 12, defines element (Element) 2679 It is a, link library file (Linkbase) 36.

The XBRL instance document is as shown in table 1.

Table 1

According to table 1 it is found that in XBRL instance document (i.e. financial data) financial data parsing obtain data item include project, Numerical value, the data item includes for example for financial data: net assets per share, asset-liability ratio, net assets ratio, cash Ratio, stock ratio, current assets ratio, fixed assets ratio, turnover of total assets etc., meaning is as shown in table 1) nearly all For continuous data, transaction database cannot be directly formed.So also needing to carry out discretization.And Reduce function is mainly used for The XBRL instance document of parsing is generated Boolean matrix to generate.

The identical value of key in intermediate key-value pair that map function generates is passed to a reduce letter by MapReduce technology Number.When it is implemented, the implementation method code for generating the Boolean matrix is as follows:

ClassMapper

methodmap(String input_key,String input_value):

//input_key:text document name

//input_value:document contents

for eachword w ininput_value:

EmitIntermediate(w,"1")；

Reduce function receives a key and a relevant class value, this class value is merged and generates one group of scale more Small value (there is usually one or zero value).

ClassReducer

method reduce(String output_key,Iterator intermediate_values):

//output_key:a word

//output_values:a list of counts

Intresult=0；

for each v in intermediate_values:

Result+=ParseInt (v)；

Emit(AsString(result))；

In the example of statistics word frequency, the key that map function receives is filename, and value is the content of file, and map is traversed one by one Word, as soon as a word w is often encountered, generation one intermediate key-value pair<w, " 1 ">, this indicates word w, and we has found one again； MapReduce by key it is identical (be all that the key-value pair of word w) is transmitted to reduce function, the key that such reduce function receives is exactly Word w, value are a string " 1 " (most basic realization is in this way, but can optimize), and number is equal to the number for the key-value pair that key is w, Then by these " 1 " the cumulative frequency of occurrence for just obtaining word w.The frequency of occurrence of these last words can be written to user's definition Position, be stored in distributed memory system (GFS or the HDFS of bottom.

Further, the discrete method includes

π_A={ [a₀,a₁],[a₁,a₂],...,[a_k-1,a_k], wherein a0=a, ak=b.

When it is implemented, the present invention use minimum entropy supervision discrete logarithm, carry out XBRL continuous type financial data from Dispersion.

Discretization based on entropy is a kind of supervision, top-down splitting technique.Counting and determining classification point Shi Li With class distributed intelligence.For discretization numerical attribute A, this method selects the value with minimum entropy of A as split point, and passs Return ground division result section.

Minimum entropy splitting algorithm formula are as follows:

Wherein, m is the number of different element sets in the financial data D of the parsing XBRL instance document acquisition, P_iFor D The probability of middle element i.The target of discretization is to select some numerical point, and section D divide for D1 and D2, so that D reconciles after dividing Information entropy minimization, following formula are calculated:

For example, when it is implemented, duplicate attribute value, calculating probability do not save division points, calculate entropy H (p for k_k)。 When specific calculating, optional selection 2 is adjoined section and is merged, and it is poor minimum to make the entropy for merging front and back, and reset division points, protects Deposit the entropy after merging:

Calculate C_k-1=(k₀-1)*H(p_k-1)-H(p_k0)*(k-2)；

if(C_k-1>C_k)(k--；Recalculate interval probability.

It should be noted that when it is implemented, also optionally utilizing other discrete logarithms, such as fft algorithm.Minimum entropy The advantage of supervision discrete logarithm be it is simple, be suitable for the present invention this not to certain single continuous financial attribute progress discretization Complicated application scenarios.

S103, the Boolean matrix is subjected to piecemeal processing, the corresponding boolean's square of all piecemeals is counted by iterative algorithm The quantity of different elements, obtains the frequent item set in the XBRL instance document according to the quantity of different elements in battle array.

It is carried out when it is implemented, minor matrix is distributed on each computer node by the present invention using mapreduce technology Technology, i.e., each node count the element in small matrix.

When it is implemented, by Boolean matrix piecemeal processing using optionally being handled using following technology:

http://www.programgo.com/article/1901458937/

The excavation of frequent episode must be scanned Transaction Information (i.e.).It is all when carrying out a scanning to data Frequent item set algorithm can all safeguard many different count values in memory.If storing these countings without enough memories, The random change counted to some may all need a page being loaded into memory from disk.Algorithm will be trembled in this way It is dynamic, greatly reduce the speed of service of algorithm.Therefore, the present invention carries out XBRL transaction set (the i.e. present invention using distributed algorithm The Boolean matrix of generation) frequent episode excavated.Piecemeal is carried out to XBRL Transaction Information set by HDFS.Matrix in block form Partition principle is divided with horizontal line and vertical line, carries out frequent item set mining using Boolean matrix in inventive algorithm, It is illustrated for using vertical division.

The present invention can by the algorithm that iterative algorithm counts the quantity of different elements in the corresponding Boolean matrix of all piecemeals That selects is as shown in Figure 3.The present invention obtains element therein by scanning Boolean matrix first, and the element for including in a matrix is not Together, each element may be the frequent episode in financial data, therefore the present invention passes through the quantity to the element in Boolean matrix Statistics obtain the total quantitys of different elements, judge whether it is frequent episode according to the total quantity of different elements, if frequent episode, The frequent k item collection currently calculated then is added in this.When it is implemented, whether some element is that frequent episode will be according to preset door Limit is compared, the thresholding according to be as needed artificially specify.

Further, the method also includes

S104, the preset rating database of inquiry, obtain the corresponding grade of frequent episode and its evaluation in the frequent item set Information assesses the XBRL instance document, and the rating database includes the incidence relation of frequent episode and evaluation information.

The present invention excavates the frequent episode in financial data by the data digging method based on XBRL instance document, It may be found that from the frequent item set finally obtained: for example, the per share undistributed profit of listed company, asset-liability ratio, net money Produce certain several discrete value of the indexs such as ratio, cash ratio, stock turnover rate, net assets growth rate and the turnover of total assets repeatedly It appears in frequent item set, financial data is evaluated to be further introduced into non-financial target, such as to Corporate Finance shape Condition is graded, whether Financial fraud is judged for company.Corresponding correlation rule, such as asset-liability ratio can be constructed simultaneously Height, accounts receivable accounting is high, and the low then Financial fraud possibility of per share profit growth rate is high.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of data digging method based on XBRL instance document, which comprises the steps of:

XBRL instance document is obtained, the XBRL instance document is stored using Hadoop platform HDFS file system；

The XBRL instance document that the Hadoop platform is stored carries out fragment processing, will be every by MapReduce technology It is that the XBRL instance document is parsed and generates corresponding Boolean matrix on a fragment, the XBRL example text after parsing Part includes financial data；Wherein, the Boolean matrix is to carry out discrete calculation to the financial data of XBRL instance document parsing Method processing generates；The discrete logarithm method includes:

The range [a, b] of the financial data of the XBRL instance document parsing is obtained, wherein a, b are respectively same financial data Minimum value, maximum value；

Split point is obtained using minimum entropy splitting algorithm, the financial data range is carried out by section according to the split point and is drawn Point, the financial data range after demarcation interval is carried out by interval division by recursive algorithm again, until being divided into following section:

π A={ [a0, a1], [a1, a2] ..., [ak-1, ak] }, wherein a0=a, ak=b；

The Boolean matrix is subjected to piecemeal processing, different members in the corresponding Boolean matrix of all piecemeals are counted by iterative algorithm The quantity of element obtains the frequent item set in the XBRL instance document after parsing according to the quantity of different elements.

2. the data digging method of XBRL instance document according to claim 1, which is characterized in that the method also includes

Preset rating database is inquired, the corresponding grade of frequent episode and its evaluation information in the frequent item set are obtained, to institute It states the corresponding financial data of XBRL instance document to be assessed, the rating database includes frequent episode to be associated with evaluation information Relationship.

3. the data digging method of XBRL instance document according to claim 1, which is characterized in that minimum entropy splitting algorithm Formula are as follows:

Wherein, m is the number of different element sets in the financial data D of the parsing XBRL instance document acquisition, and Pi is member in D The probability of plain i,