CN105930375B - A kind of data digging method based on XBRL file - Google Patents

A kind of data digging method based on XBRL file Download PDF

Info

Publication number
CN105930375B
CN105930375B CN201610228600.2A CN201610228600A CN105930375B CN 105930375 B CN105930375 B CN 105930375B CN 201610228600 A CN201610228600 A CN 201610228600A CN 105930375 B CN105930375 B CN 105930375B
Authority
CN
China
Prior art keywords
instance document
xbrl instance
xbrl
data
financial data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610228600.2A
Other languages
Chinese (zh)
Other versions
CN105930375A (en
Inventor
冯涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University of Finance and Economics
Original Assignee
Yunnan University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University of Finance and Economics filed Critical Yunnan University of Finance and Economics
Priority to CN201610228600.2A priority Critical patent/CN105930375B/en
Publication of CN105930375A publication Critical patent/CN105930375A/en
Application granted granted Critical
Publication of CN105930375B publication Critical patent/CN105930375B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention discloses a kind of data digging method based on XBRL instance document, includes the following steps: to obtain XBRL instance document, the XBRL instance document is stored using Hadoop platform HDFS file system;The XBRL instance document that the Hadoop platform is stored carries out fragment processing, will be that the XBRL instance document is parsed and generates corresponding Boolean matrix on each fragment by MapReduce technology;The Boolean matrix is subjected to piecemeal processing, the quantity of different elements in the corresponding Boolean matrix of all piecemeals is counted by iterative algorithm, the frequent episode of the XBRL instance document is obtained according to quantity, obtains the data of the corresponding XBRL instance document of the frequent episode.The present invention realizes the storage of magnanimity XBRL instance document by Hadoop platform, after realizing the parsing to XBRL instance document using the use of Map/Reduce function in Hadoop platform simultaneously and generate corresponding Boolean matrix, Boolean matrix is subjected to the calculation amount in piecemeal processing reduction data mining process using Map/Reduce data again, while improving calculating speed.

Description

A kind of data digging method based on XBRL file
Technical field
The invention belongs to field of computer technology, in particular to a kind of data digging method based on XBRL file.
Background technique
XBRL (eXtensible Business Reporting Language, Extensible Business Reporting Language), is one Markup language of the kind based on XML, definition and exchange for business and financial information.XBRL facilitate business information establishment, Analysis and exchange, for provide and using financial data owner provide low cost, it is efficient service and it is reliable and accurate Business information ".XBRL is in an increasingly wide range of applications all over the world at present, such as stock supervisory committee, the U.S. (SEC), Canada The multinational cards such as stock supervisory committee (CSA), Toronto Stock Exchange, stock exchange, South Korea, Tokyo exchange, Shanghai Exchange, Shenzhen stock exchange Certificate supervisory organ and stock exchange are all using XBRL technology.XBRL technical system is made of 4 parts, is specification respectively (Specification), mode (Schema), classification standard (Taxonomy) and instance document (XBRL Instance).Wherein XBRL specification describes the structure of XBRL file, the syntax and semantics of specified in more detail XBRL classification standard and XBRL instance document; Classification information generally comprises two parts: schema definition document (* .xsd) and chained library (linkbase).Classification information file is used for Schema verifying (schema validation) is provided to instance document;Instance document must satisfy the items rule of its definition. However the thin bulk information of listing enterprise disclosed to the cost that can greatly increase investor and analyst's search and processing information. Therefore, under network information disclosure mechanism how effectively from information announcing data, obtain valid data be this field urgently The problem solved.
Summary of the invention
Data mining in existing XBRL instance document is computationally intensive, slow-footed technological deficiency to solve, and the present invention passes through Hadoop platform realizes the storage of mass data, realizes the parsing to the XBRL instance document simultaneously using MapReduce function Boolean matrix is generated, what it is by the Boolean matrix is that the efficiency for improving financial data and excavating is realized in piecemeal processing.
The present invention provides a kind of data digging methods based on XBRL instance document, include the following steps:
XBRL instance document is obtained, the XBRL instance document is deposited using Hadoop platform HDFS file system Storage;
The XBRL instance document that the Hadoop platform is stored carries out fragment processing, passes through MapReduce technology It will be that the XBRL instance document is parsed and generates corresponding Boolean matrix on each fragment;
The Boolean matrix is subjected to piecemeal processing, is counted in the corresponding Boolean matrix of all piecemeals not by iterative algorithm With the quantity of element, the frequent episode of the XBRL instance document is obtained according to quantity, and it is corresponding described to obtain the frequent episode The data of XBRL instance document.
Further, the method also includes
Preset rating database is inquired, the corresponding grade of frequent episode and its evaluation information in the frequent item set are obtained, The XBRL instance document is assessed, the rating database includes the incidence relation of frequent episode and evaluation information.
Further, the Boolean matrix is that discrete rear generation is carried out to the result of XBRL instance document parsing.
Further, the discrete method includes
The financial data range [a, b] of the XBRL instance document parsing is obtained, wherein a, b are respectively same financial data Minimum value, maximum value;
The financial data range is subjected to interval division using minimum entropy splitting algorithm, by recursive algorithm by dividing regions Between after financial data range carry out interval division again, until be divided into following section:
πA={ [a0,a1],[a1,a2],...,[ak-1,ak], wherein a0=a, ak=b.
Further, minimum entropy splitting algorithm formula are as follows:
Wherein, m is the number of different element sets in the financial data D of the parsing XBRL instance document acquisition, PiIt is first in D The probability of plain i,
Section D is divided for D1 and D2, so that reconciling information entropy minimization after D division, following formula is calculated:
To sum up, the present invention realizes the storage of magnanimity XBRL instance document by Hadoop platform, while flat using Hadoop After the use of Map/Reduce function realizes the parsing to the XBRL instance document and generates corresponding Boolean matrix in platform, then It is secondary that the Boolean matrix is subjected to the calculation amount in piecemeal processing reduction data mining process using Map/Reduce data, simultaneously Calculating speed is improved, realizes the purpose for quickly excavating financial data in the XBRL instance document, and carry out accordingly to it Assessment.
Detailed description of the invention
Fig. 1 is the flow diagram of the data digging method of the present invention based on XBRL instance document;
Fig. 2 is the parsing block diagram representation of the present invention based on XBRL instance document;
Fig. 3 is the iterative algorithm of data digging method one embodiment of the present invention based on XBRL instance document Flow diagram.
Specific embodiment
The present invention is described in further detail below by specific embodiment and in conjunction with attached drawing.
The present invention provides a kind of data digging methods based on XBRL instance document.
Described method includes following steps:
S101, obtain XBRL instance document, by the XBRL instance document using Hadoop platform HDFS file system into Row storage.
When it is implemented, all XBRL instance documents are optionally obtained from internet, and by all XBRL Instance document is into storage.Since the quantity of the XBRL instance document is very big, when single server is stored, server is born Carry very big, therefore the present invention is operated using Hadoop platform, i.e., using being stored in HDFS distributed data base.It is this Mode can completely save all information in all XBRL instance documents, while also solve asking for mass data storage Topic is extracted for next step magnanimity XBRL data information, analysis is got ready.
S102, the XBRL instance document for storing the Hadoop platform carry out fragment processing, pass through MapReduce Technology will be that the XBRL instance document is parsed and generates corresponding Boolean matrix on each fragment.
The parsing of the XBRL instance document is parsed according to classification standard.
As shown in Fig. 2, the parsing result schematic diagram of the XBRL instance document.The parsing need of work of XBRL is according to XBRL The files such as schema file (Taxonomy.xsd) and matched References.xml, Presentations.xml, XML parsing operation is carried out to XBRL instance data.Classification information file, which is used to provide Schema to instance document, verifies (schema ), validation instance document must satisfy the items rule of its definition, include: each element occurred in instance document It must be declared in the schema definition document, or the sound in other schema definition documents of schema definition document reference It is bright;The each element occurred in instance document, content all must be with members specified when stating the element in schema definition document Plain type is consistent etc..Chained library is further described to schema definition document, and instance document, which equally must satisfy it, to be determined The items rule of justice.When Map function extracts data from the series of elements of instance document, chained library provides these data Between relationship.
Further, the Boolean matrix is that discrete rear generation is carried out to the result of XBRL instance document parsing.
When it is implemented, for example, XBRL tray used by current Shanghai Stock Exchange and Shenzhen Stock Exchange Formula covers the full content of listed company's regular reporting (including annual report, semi-annual report and quarterly report) abstract." on Company, city information announcing digital criterion " in strict accordance with " accounting standards for enterprises " (revised edition in 2006) and China Securities supervision pipe The reason committee works out about the related request of Disclosed Information From Public Company series criterion, stringent according to the specification of XBRL 2.1 and FRTA The correlation of (Financial Reporting Taxonomies Architecture, that is, business report classification information frame) 1.0 Regulation is write.The classification information common pattern of latest edition defines file (Schema) 12, defines element (Element) 2679 It is a, link library file (Linkbase) 36.
The XBRL instance document is as shown in table 1.
Table 1
According to table 1 it is found that in XBRL instance document (i.e. financial data) financial data parsing obtain data item include project, Numerical value, the data item includes for example for financial data: net assets per share, asset-liability ratio, net assets ratio, cash Ratio, stock ratio, current assets ratio, fixed assets ratio, turnover of total assets etc., meaning is as shown in table 1) nearly all For continuous data, transaction database cannot be directly formed.So also needing to carry out discretization.And Reduce function is mainly used for The XBRL instance document of parsing is generated Boolean matrix to generate.
The identical value of key in intermediate key-value pair that map function generates is passed to a reduce letter by MapReduce technology Number.When it is implemented, the implementation method code for generating the Boolean matrix is as follows:
ClassMapper
methodmap(String input_key,String input_value):
//input_key:text document name
//input_value:document contents
for eachword w ininput_value:
EmitIntermediate(w,"1");
Reduce function receives a key and a relevant class value, this class value is merged and generates one group of scale more Small value (there is usually one or zero value).
ClassReducer
method reduce(String output_key,Iterator intermediate_values):
//output_key:a word
//output_values:a list of counts
Intresult=0;
for each v in intermediate_values:
Result+=ParseInt (v);
Emit(AsString(result));
In the example of statistics word frequency, the key that map function receives is filename, and value is the content of file, and map is traversed one by one Word, as soon as a word w is often encountered, generation one intermediate key-value pair<w, " 1 ">, this indicates word w, and we has found one again; MapReduce by key it is identical (be all that the key-value pair of word w) is transmitted to reduce function, the key that such reduce function receives is exactly Word w, value are a string " 1 " (most basic realization is in this way, but can optimize), and number is equal to the number for the key-value pair that key is w, Then by these " 1 " the cumulative frequency of occurrence for just obtaining word w.The frequency of occurrence of these last words can be written to user's definition Position, be stored in distributed memory system (GFS or the HDFS of bottom.
Further, the discrete method includes
The financial data range [a, b] of the XBRL instance document parsing is obtained, wherein a, b are respectively same financial data Minimum value, maximum value;
The financial data range is subjected to interval division using minimum entropy splitting algorithm, by recursive algorithm by dividing regions Between after financial data range carry out interval division again, until be divided into following section:
πA={ [a0,a1],[a1,a2],...,[ak-1,ak], wherein a0=a, ak=b.
When it is implemented, the present invention use minimum entropy supervision discrete logarithm, carry out XBRL continuous type financial data from Dispersion.
Discretization based on entropy is a kind of supervision, top-down splitting technique.Counting and determining classification point Shi Li With class distributed intelligence.For discretization numerical attribute A, this method selects the value with minimum entropy of A as split point, and passs Return ground division result section.
Minimum entropy splitting algorithm formula are as follows:
Wherein, m is the number of different element sets in the financial data D of the parsing XBRL instance document acquisition, PiFor D The probability of middle element i.The target of discretization is to select some numerical point, and section D divide for D1 and D2, so that D reconciles after dividing Information entropy minimization, following formula are calculated:
For example, when it is implemented, duplicate attribute value, calculating probability do not save division points, calculate entropy H (p for kk)。 When specific calculating, optional selection 2 is adjoined section and is merged, and it is poor minimum to make the entropy for merging front and back, and reset division points, protects Deposit the entropy after merging:
Calculate Ck-1=(k0-1)*H(pk-1)-H(pk0)*(k-2);
if(Ck-1>Ck)(k--;Recalculate interval probability.
It should be noted that when it is implemented, also optionally utilizing other discrete logarithms, such as fft algorithm.Minimum entropy The advantage of supervision discrete logarithm be it is simple, be suitable for the present invention this not to certain single continuous financial attribute progress discretization Complicated application scenarios.
S103, the Boolean matrix is subjected to piecemeal processing, the corresponding boolean's square of all piecemeals is counted by iterative algorithm The quantity of different elements, obtains the frequent item set in the XBRL instance document according to the quantity of different elements in battle array.
It is carried out when it is implemented, minor matrix is distributed on each computer node by the present invention using mapreduce technology Technology, i.e., each node count the element in small matrix.
When it is implemented, by Boolean matrix piecemeal processing using optionally being handled using following technology:
http://www.programgo.com/article/1901458937/
The excavation of frequent episode must be scanned Transaction Information (i.e.).It is all when carrying out a scanning to data Frequent item set algorithm can all safeguard many different count values in memory.If storing these countings without enough memories, The random change counted to some may all need a page being loaded into memory from disk.Algorithm will be trembled in this way It is dynamic, greatly reduce the speed of service of algorithm.Therefore, the present invention carries out XBRL transaction set (the i.e. present invention using distributed algorithm The Boolean matrix of generation) frequent episode excavated.Piecemeal is carried out to XBRL Transaction Information set by HDFS.Matrix in block form Partition principle is divided with horizontal line and vertical line, carries out frequent item set mining using Boolean matrix in inventive algorithm, It is illustrated for using vertical division.
The present invention can by the algorithm that iterative algorithm counts the quantity of different elements in the corresponding Boolean matrix of all piecemeals That selects is as shown in Figure 3.The present invention obtains element therein by scanning Boolean matrix first, and the element for including in a matrix is not Together, each element may be the frequent episode in financial data, therefore the present invention passes through the quantity to the element in Boolean matrix Statistics obtain the total quantitys of different elements, judge whether it is frequent episode according to the total quantity of different elements, if frequent episode, The frequent k item collection currently calculated then is added in this.When it is implemented, whether some element is that frequent episode will be according to preset door Limit is compared, the thresholding according to be as needed artificially specify.
Further, the method also includes
S104, the preset rating database of inquiry, obtain the corresponding grade of frequent episode and its evaluation in the frequent item set Information assesses the XBRL instance document, and the rating database includes the incidence relation of frequent episode and evaluation information.
The present invention excavates the frequent episode in financial data by the data digging method based on XBRL instance document, It may be found that from the frequent item set finally obtained: for example, the per share undistributed profit of listed company, asset-liability ratio, net money Produce certain several discrete value of the indexs such as ratio, cash ratio, stock turnover rate, net assets growth rate and the turnover of total assets repeatedly It appears in frequent item set, financial data is evaluated to be further introduced into non-financial target, such as to Corporate Finance shape Condition is graded, whether Financial fraud is judged for company.Corresponding correlation rule, such as asset-liability ratio can be constructed simultaneously Height, accounts receivable accounting is high, and the low then Financial fraud possibility of per share profit growth rate is high.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (3)

1. a kind of data digging method based on XBRL instance document, which comprises the steps of:
XBRL instance document is obtained, the XBRL instance document is stored using Hadoop platform HDFS file system;
The XBRL instance document that the Hadoop platform is stored carries out fragment processing, will be every by MapReduce technology It is that the XBRL instance document is parsed and generates corresponding Boolean matrix on a fragment, the XBRL example text after parsing Part includes financial data;Wherein, the Boolean matrix is to carry out discrete calculation to the financial data of XBRL instance document parsing Method processing generates;The discrete logarithm method includes:
The range [a, b] of the financial data of the XBRL instance document parsing is obtained, wherein a, b are respectively same financial data Minimum value, maximum value;
Split point is obtained using minimum entropy splitting algorithm, the financial data range is carried out by section according to the split point and is drawn Point, the financial data range after demarcation interval is carried out by interval division by recursive algorithm again, until being divided into following section:
π A={ [a0, a1], [a1, a2] ..., [ak-1, ak] }, wherein a0=a, ak=b;
The Boolean matrix is subjected to piecemeal processing, different members in the corresponding Boolean matrix of all piecemeals are counted by iterative algorithm The quantity of element obtains the frequent item set in the XBRL instance document after parsing according to the quantity of different elements.
2. the data digging method of XBRL instance document according to claim 1, which is characterized in that the method also includes
Preset rating database is inquired, the corresponding grade of frequent episode and its evaluation information in the frequent item set are obtained, to institute It states the corresponding financial data of XBRL instance document to be assessed, the rating database includes frequent episode to be associated with evaluation information Relationship.
3. the data digging method of XBRL instance document according to claim 1, which is characterized in that minimum entropy splitting algorithm Formula are as follows:
Wherein, m is the number of different element sets in the financial data D of the parsing XBRL instance document acquisition, and Pi is member in D The probability of plain i,
Section D is divided for D1 and D2, so that reconciling information entropy minimization after D division, following formula is calculated:
CN201610228600.2A 2016-04-13 2016-04-13 A kind of data digging method based on XBRL file Expired - Fee Related CN105930375B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610228600.2A CN105930375B (en) 2016-04-13 2016-04-13 A kind of data digging method based on XBRL file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610228600.2A CN105930375B (en) 2016-04-13 2016-04-13 A kind of data digging method based on XBRL file

Publications (2)

Publication Number Publication Date
CN105930375A CN105930375A (en) 2016-09-07
CN105930375B true CN105930375B (en) 2019-04-02

Family

ID=56838159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610228600.2A Expired - Fee Related CN105930375B (en) 2016-04-13 2016-04-13 A kind of data digging method based on XBRL file

Country Status (1)

Country Link
CN (1) CN105930375B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709822A (en) * 2017-03-14 2017-05-24 国家电网公司 Industry power consumption data correlation mining method and device
CN107451253B (en) * 2017-07-31 2020-11-20 北京京东尚科信息技术有限公司 Data processing method and device
CN109857712A (en) * 2019-01-30 2019-06-07 中南大学 A kind of the data normalization analysis method and system of middle-size and small-size insurance business
CN110297944B (en) * 2019-07-02 2022-02-11 中国工商银行股份有限公司 Distributed XML data processing method and system
CN112925821B (en) * 2021-02-07 2022-05-13 韶关学院 MapReduce-based parallel frequent item set incremental data mining method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation
CN103077253B (en) * 2013-01-25 2015-09-30 西安电子科技大学 Higher-dimension mass data GMM clustering method under Hadoop framework
CN103617169B (en) * 2013-10-23 2017-04-05 杭州电子科技大学 A kind of hot microblog topic extracting method based on Hadoop
CN105022783A (en) * 2015-06-03 2015-11-04 南京邮电大学 Hadoop based user service security system and method

Also Published As

Publication number Publication date
CN105930375A (en) 2016-09-07

Similar Documents

Publication Publication Date Title
CN105930375B (en) A kind of data digging method based on XBRL file
US20220382778A1 (en) Aggregation framework system architecture and method
US10366100B2 (en) Aggregation framework system architecture and method
US20200192894A1 (en) System and method for using data incident based modeling and prediction
US20200218737A1 (en) Method, system and program product for matching of transaction records
Cerchiello et al. Twitter data models for bank risk contagion
US7171406B2 (en) OLAP-based web access analysis method and system
Chychyla et al. Using XBRL to conduct a large-scale study of discrepancies between the accounting numbers in Compustat and SEC 10-K filings
US20160267082A1 (en) Systems and methods for managing data
CN104679646B (en) A kind of method and apparatus for detecting SQL code defect
US20200058025A1 (en) System, methods, and devices for payment recovery platform
US11188981B1 (en) Identifying matching transfer transactions
Li et al. Big data audit based on financial sharing service model
Thurow et al. Imputing missings in official statistics for general tasks–our vote for distributional accuracy
Zhang et al. The approaches to contextual transaction trust computation in e‐Commerce environments
CN116542696A (en) User group classification method, device, equipment and storage medium
Augusto et al. Optimization framework for DFG-based automated process discovery approaches
CN116228402A (en) Financial credit investigation feature warehouse technical support system
Siripongvakin et al. Infrastructure project investment decision timing using a real options analysis framework with Rainbow option
CN109213909A (en) A kind of big data analysis system and its analysis method fusion search and calculated
US20130179324A1 (en) Method, system, and program product for determining a value of an index
Dharavath et al. Entity resolution based EM for integrating heterogeneous distributed probabilistic data
US20170186095A1 (en) Centralized GAAP approach for multidimensional accounting to reduce data volume and data reconciliation processing costs
CN109300023A (en) A kind of method and system that increment tax on land value big data is extracted and applied
US20230394478A1 (en) Generating and publishing unified transaction streams from a plurality of computer networks for downstream computer service systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190402

Termination date: 20200413

CF01 Termination of patent right due to non-payment of annual fee