CN108073678A - Applied to document analyzing and processing method, system and the device in big data analysis - Google Patents

Applied to document analyzing and processing method, system and the device in big data analysis Download PDF

Info

Publication number
CN108073678A
CN108073678A CN201711080717.1A CN201711080717A CN108073678A CN 108073678 A CN108073678 A CN 108073678A CN 201711080717 A CN201711080717 A CN 201711080717A CN 108073678 A CN108073678 A CN 108073678A
Authority
CN
China
Prior art keywords
index
financial
data
financial statement
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711080717.1A
Other languages
Chinese (zh)
Other versions
CN108073678B (en
Inventor
陈贤耿
纪晓阳
伍紫莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Industry Kaiyuan Science And Technology Co Ltd
Original Assignee
Guangdong Industry Kaiyuan Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Industry Kaiyuan Science And Technology Co Ltd filed Critical Guangdong Industry Kaiyuan Science And Technology Co Ltd
Priority to CN201711080717.1A priority Critical patent/CN108073678B/en
Publication of CN108073678A publication Critical patent/CN108073678A/en
Application granted granted Critical
Publication of CN108073678B publication Critical patent/CN108073678B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/12Accounting
    • G06Q40/125Finance or payroll

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Finance (AREA)
  • Databases & Information Systems (AREA)
  • Accounting & Taxation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention discloses a kind of document analyzing and processing method, system and device, this method to include:Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, the localization process of financial statement is carried out to the document of different-format;After carrying out localization process to the data in financial statement, financial data and corresponding index name and time are recorded;Data are recorded after carrying out unit conversion to numeric type data.The system includes:Construction unit, acquiring unit, the first positioning unit, the second positioning unit and scaling unit.The device includes memory and processor, when program is executed by processor, processor is made to realize the document analyzing and processing method.The present invention can quickly and accurately parse the financial data in documents in various formats, and that improves parsing scheme applies compatible, comprehensive, accuracy and treatment effeciency.The present invention can be widely applied to as document analyzing and processing method, system and device in big data analytic technique field.

Description

Applied to document analyzing and processing method, system and the device in big data analysis
Technical field
The present invention relates to big data analytic technique more particularly to a kind of document dissection process applied in big data analysis Method, system and device.
Background technology
Technology word is explained:
Regular expression:A series of character string of some syntactic rule of matching is described, matched using single character string.
Balance sheet:Represent that enterprise fixes the date one financial situation (i.e. assets, the debt of (being usually each accounting end of term) With the situation of proprietary interest) main accounting statement.
Profit flow table:Reflect report of the enterprise in management performance during a certain accounting period.
Cash flow statement:The report that reflection enterprise flows in and out in cash and cash-equivalent during a certain accounting period.
In business finance big data analysis field, the acquisition of many financial datas needs the annual report or hair that are disclosed from company It is extracted in the documents such as the recruitment specification issued during debt, and the accuracy to extracting obtained data has very high want It asks.At present, these documents are usually saved as PDF format, therefore, at present for parsing to realize the side that data are extracted to document Case is made that research only for PDF document, that is to say, that existing document parsing scheme is only capable of acting in PDF document. However, in fact, in addition to PDF format, financial data documents can also be saved as other document formats, as WORD forms, EXCEL forms etc., and existing document parsing scheme can not be suitable for other format files in addition to PDF format, so The acquisition channel of the data source of document parsing is then limited, and the application for reducing document parsing is compatible and comprehensive.Thus As it can be seen that a kind of scheme that quick, accurate dissection process can be carried out to the document of a variety of different-formats of design, be at present there is an urgent need to One of solve the problems, such as.
The content of the invention
In order to solve the above-mentioned technical problem, the object of the present invention is to provide a kind of document solutions applied in big data analysis Processing method, system and device are analysed, fast and accurately dissection process can be carried out to the financial documentation of a variety of different-formats.
First technical solution of the present invention is:It, should applied to the document analyzing and processing method in big data analysis Method comprises the following steps:
Build the regular expression rule of financial index;
It obtains the initiation feature index of financial statement and terminates characteristic index;
Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to different-format Document carries out the localization process of financial statement;
After carrying out localization process to the data in financial statement, financial data and index corresponding with financial data are recorded Title and time;
After carrying out unit conversion to the data for belonging to numeric type, the data obtained after conversion are recorded.
Second technical solution of the present invention is:It, should applied to the document dissection process system in big data analysis System includes:
Construction unit, for building the regular expression of financial index rule;
Acquiring unit, for obtaining the initiation feature index of financial statement and terminating characteristic index;
First positioning unit, for utilizing the regular expression of financial index rule, initiation feature index and end feature Index carries out the localization process of financial statement to the document of different-format;
Second positioning unit, for in financial statement data carry out localization process after, record financial data and with The corresponding index name of financial data and time;
Scaling unit after carrying out unit conversion to the data for belonging to numeric type, records the data obtained after conversion.
3rd technical solution of the present invention is:It, should applied to the document dissection process device in big data analysis Device includes:
At least one processor;
At least one processor, for storing at least one program;
When at least one program is performed by least one processor so that at least one processor is realized The document analyzing and processing method being applied to as described in above-mentioned first technical solution in big data analysis.
The advantageous effect of the method for the present invention, system and device is:The present invention by using financial index regular expression Rule, initiation feature index and end characteristic index carry out the localization process of financial statement to the document of different-format, then right After data in financial statement carry out localization process, record financial data and index name corresponding with financial data and when Between, and after the data to belonging to numeric type carry out unit conversion, obtained data after record conversion, it can be seen that, by making With the present invention, quickly and accurately the financial data in documents in various formats can be parsed, so as to obtain high accuracy Financial data and corresponding index name and time so greatly improve and answering for financial data parsing scheme are carried out from document With compatible, comprehensive and accuracy and treatment effeciency.
Description of the drawings
Fig. 1 is a kind of step flow chart of document analyzing and processing method applied in big data analysis of the present invention;
Fig. 2 is a kind of structure diagram of document dissection process system applied in big data analysis of the present invention;
Fig. 3 is an a kind of specific embodiment step of document analyzing and processing method applied in big data analysis of the present invention Flow chart.
Specific embodiment
Embodiment 1
As shown in Figure 1, the present embodiment provides a kind of document analyzing and processing method applied in big data analysis, this method Comprise the following steps:
Build the regular expression rule of financial index;
It obtains the initiation feature index of financial statement and terminates characteristic index;
Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to different-format Document carries out the localization process of financial statement;
After carrying out localization process to the data in financial statement, financial data and index corresponding with financial data are recorded Title and time;
After carrying out unit conversion to the data for belonging to numeric type, the data obtained after conversion are recorded.
Be further used as the preferred embodiment of the present embodiment, the regular expression rule of the structure financial index this Step specifically includes:
Title storehouse is obtained, the title of financial index is stored in the title storehouse;
Actual name storehouse is built, the reality of the financial index extracted from document is stored in the actual name storehouse Title;
Establish the title mapping relations between the title in the actual name in actual name storehouse and title storehouse;
According to title mapping relations, the regular expression for building financial index is regular.
It is further used as the preferred embodiment of the present embodiment, the initiation feature index for obtaining financial statement and end The step for characteristic index, specifically includes:
Extract multiple documents for including financial statement;
Index extraction is carried out to the starting content of the financial statement in each document, then, according to the index extracted Frequency of occurrence, according to order from big to small, index is ranked up, m1 index, which builds to obtain initiation feature, before selection refers to Mark list;
Index extraction is carried out to the end content of the financial statement in each document, then, according to the index extracted Frequency of occurrence, according to order from big to small, index is ranked up, m2 index, which builds to obtain, before selection terminates feature and refer to Mark list.
The preferred embodiment of the present embodiment is further used as, the regular expression using financial index is regular, rises The step for beginning characteristic index is with terminating characteristic index, the localization process of financial statement being carried out to the document of different-format, tool Body includes:
It is that each financial statement form configuration in document is corresponding according to the precedence that financial statement form occurs ID;
Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to every in document One financial statement form carries out analysis judgment, and initiate table lattice and end form are drawn so as to position, wherein, the starting form Refer to the financial statement form in financial statement initial position, the end form is referred in financial statement stop bits The financial statement form put;
According to the ID of starting form and the ID of end form, will originate all between the ID of form and the ID of end form The data in financial statement form corresponding to ID, mark are the financial data of financial statement type.
The preferred embodiment of the present embodiment is further used as, the regular expression using financial index is regular, rises Beginning characteristic index and end characteristic index, carry out analysis judgment, so as to be positioned to each financial statement form in document The step for going out to originate form and terminating form, specifically includes:
When the beginning flag of form is falsity, using the regular expression rule of financial index, from current finance report Identification extracts n1 index in the starting content of table form, then, will identify that the n1 index extracted refers to initiation feature After index in mark list is matched, the first matching rate is obtained, then, when the first matching rate is more than first threshold, is then sentenced Financial statement form before settled is starting form;
When the beginning flag of form is true value, using the regular expression rule of financial index, from current finance report Identification extracts n2 index in the end content of table form, then, will identify that the n2 index extracted refers to terminating feature After index in mark list is matched, the second matching rate is obtained, then, when the second matching rate is more than second threshold, is then sentenced Financial statement form before settled is end form;
Wherein, n1<M1, n2<m2.
The preferred embodiment of the present embodiment is further used as, the data in financial statement carry out localization process Afterwards, the step for recording financial data and index name corresponding with financial data and time, specifically includes:
Establish the first mapping relations between the index name corresponding to the data in financial statement form and place line number;
Establish the second mapping relations between the temporal information corresponding to the data in financial statement form and place columns;
Utilize ranks number, the first mapping relations and the second mapping relations corresponding to the data in financial statement form, note Record financial data and index name corresponding with financial data and time.
The preferred embodiment of the present embodiment is further used as, the described pair of data for belonging to numeric type carry out unit conversion Afterwards, the step for data obtained after record conversion, specifically include:
Using the regular expression rule of unit information, identification draws the unit information of financial statement;
According to the unit information that identification is drawn, unit conversion processing is carried out to the data for belonging to numeric type, order belongs to numerical value The data reduction of type is the financial data in units of member and records.
The preferred embodiment of the present embodiment is further used as, the regular expression rule using unit information is known The step for not drawing the unit information of financial statement, specifically includes:
Data in financial statement form are traveled through, using the regular expression rule of unit information, judge to identify Whether the data in financial statement form have unit information, if so, the unit identified from financial statement form is believed The unit information as the required financial statement for identifying and drawing is ceased, conversely, then knowing using the regular expression rule of unit information After not going out the table title of financial statement, the matching that unit information is carried out in the n3 character string after the table title identified is searched Rope, using the unit information nearest apart from table title searched as the unit information of the required financial statement for identifying and drawing.
Embodiment 2
As shown in Fig. 2, the present embodiment provides a kind of document dissection process system applied in big data analysis, the systems Including:
Construction unit, for building the regular expression of financial index rule;
Acquiring unit, for obtaining the initiation feature index of financial statement and terminating characteristic index;
First positioning unit, for utilizing the regular expression of financial index rule, initiation feature index and end feature Index carries out the localization process of financial statement to the document of different-format;
Second positioning unit, for in financial statement data carry out localization process after, record financial data and with The corresponding index name of financial data and time;
Scaling unit after carrying out unit conversion to the data for belonging to numeric type, records the data obtained after conversion.
The preferred embodiment of the present embodiment is further used as, the construction unit includes:
First acquisition module for obtaining title storehouse, is stored with the standard of financial index in the title storehouse Title;
First structure module, for building actual name storehouse, is stored in the actual name storehouse and is extracted from document The actual name of the financial index arrived;
First establishes module, is referred to as establishing the actual name in actual name storehouse with the standard name in title storehouse Between title mapping relations;
Second establishes module, for according to title mapping relations, the regular expression for building financial index to be regular.
The preferred embodiment of the present embodiment is further used as, the acquiring unit includes:
First abstraction module, for extracting multiple documents for including financial statement;
First structure module, for carrying out index extraction to the starting content of the financial statement in each document, then, According to the frequency of occurrence of the index extracted, according to order from big to small, index is ranked up, m1 index structure before selection It builds to obtain initiation feature index list;
Second structure module, for carrying out index extraction to the end content of the financial statement in each document, then, According to the frequency of occurrence of the index extracted, according to order from big to small, index is ranked up, m2 index structure before selection It builds to obtain and terminates characteristic index list.
The preferred embodiment of the present embodiment is further used as, first positioning unit includes:
First configuration module is each finance in document for the precedence occurred according to financial statement form Report form configures corresponding ID;
First locating module, for utilizing the regular expression of financial index rule, initiation feature index and end feature Index carries out analysis judgment to each financial statement form in document, and initiate table lattice and end form are drawn so as to position, Wherein, the starting form refers to the financial statement form in financial statement initial position, and the end form refers to Financial statement form in financial statement end position;
First mark module for the ID according to starting form and the ID of end form, will originate ID and the end of form The data in the financial statement form corresponding to all ID between the ID of form, mark are the wealth of financial statement type Business data.
The preferred embodiment of the present embodiment is further used as, first locating module includes:
First processing module, for when the beginning flag of form is falsity, being advised using the regular expression of financial index Then, identification extracts n1 index from the starting content of current financial statement form, then, will identify that the n1 extracted is a After index is matched with the index in initiation feature index list, the first matching rate is obtained, then, when the first matching rate is more than During first threshold, then current financial statement form is judged to originate form;
Second processing module, for when the beginning flag of form is true value, being advised using the regular expression of financial index Then, identification extracts n2 index from the end content of current financial statement form, then, will identify that the n2 extracted is a After index is matched with the index in end characteristic index list, the second matching rate is obtained, then, when the second matching rate is more than During second threshold, then current financial statement form is judged to terminate form;
Wherein, n1<M1, n2<m2.
The preferred embodiment of the present embodiment is further used as, second positioning unit includes:
3rd establishes module, for establish index name corresponding to the data in financial statement form and place line number it Between the first mapping relations;
4th establishes module, for establish temporal information corresponding to the data in financial statement form and place columns it Between the second mapping relations;
First logging modle, for using corresponding to the data in financial statement form ranks number, the first mapping relations And second mapping relations, record financial data and index name corresponding with financial data and time.
The preferred embodiment of the present embodiment is further used as, the scaling unit includes:
First identification module, for using the regular expression of unit information rule, identification to draw the unit of financial statement Information;
Second logging modle for the unit information drawn according to identification, carries out unit to the data for belonging to numeric type and changes Calculation is handled, and order belongs to the data reduction of numeric type for the financial data in units of member and records.
The preferred embodiment of the present embodiment is further used as, first identification module is specifically used for financial statement table Data in lattice are traveled through, and using the regular expression rule of unit information, judge to identify the data in financial statement form Whether unit information is had, if so, identifying what is drawn using the unit information identified from financial statement form as required The unit information of financial statement, conversely, then identifying the table title of financial statement using the regular expression rule of unit information Afterwards, the matching search of unit information is carried out in the n3 character string after the table title identified, will be searched apart from table mark Inscribe unit information of the nearest unit information as the required financial statement for identifying and drawing.
Embodiment 3
The present embodiment provides a kind of document dissection process device applied in big data analysis, which includes:
At least one processor;
At least one processor, for storing at least one program;
When at least one program is performed by least one processor so that at least one processor is realized A kind of the step of document analyzing and processing method applied in big data analysis of as described in above-described embodiment 1.
Embodiment 4
As shown in figure 3, a kind of document analyzing and processing method applied in big data analysis, specifically includes following step Suddenly.
Step S1:Build the regular expression rule of financial index.
Specifically, the step S1 is comprised preferably:
S101, title storehouse is obtained, the title of financial index is stored in the title storehouse;
Specifically, this step is preferably by the Ministry of Finance《Accounting standards for enterprises》In index name as financial index word The title storehouse in storehouse, that is to say, that with the Ministry of Finance《Accounting standards for enterprises》In mark of the index name as financial index Quasi- title, and the title of these financial index is stored in the title storehouse;
S102, actual name storehouse is built, the financial index extracted from document is stored in the actual name storehouse Actual name;
Specifically, this step specific implementation step includes:First, that randomly selects several includes actual financial statement Document, then, the actual name storehouse of financial index dictionary is established according to financial index word recorded in these documents, That is, actual name of the word as financial index that will be used to state financial index in the document comprising actual financial statement Claim, and the actual name of these financial index is stored in actual name storehouse;
S103, the title established between the title in the actual name in actual name storehouse and title storehouse map Relation;
Specifically, for any one financial index, one or more different reality are had in actual use Title, therefore, it is necessary to first determine the title in actual name storehouse corresponding to the actual name of all financial index;It is determining In the process, if financial index corresponding to actual name in actual name storehouse, there is corresponding name in title storehouse Claim, then, which is the title corresponding to the financial index;If corresponding to the actual name in actual name storehouse Financial index does not have corresponding title in title storehouse, then, then count all realities corresponding to the financial index The frequency of occurrence of border title then, will appear from title of the highest actual name of the frequency corresponding to as the financial index, And by this title be added to title storehouse in, for example, the actual name corresponding to the financial index include title a1, Title a2 and title a3, and in the document randomly selected, title a1 occurs 10 times, and title a2 occurs 8 times, and title a3 goes out Showed 2 times, at this point, then using title a1 as the financial index corresponding to title, and by title a1 be added to standard In namebase;It and then, can be according to pair between the title in the actual name in actual name storehouse and title storehouse It should be related to, establish the title mapping obtained between the title in the actual name and title storehouse in actual name storehouse and close System;It can be seen that for step S103, following steps are preferably included:
S1031, judge financial index corresponding to actual name in actual name storehouse, in title storehouse whether There is corresponding title, if so, this to be then referred to as to the title corresponding to the financial index;Conversely, then count the wealth The frequency of occurrence of each actual name corresponding to index of being engaged in, then, will appear from the highest actual name of the frequency as the wealth Title corresponding to index of being engaged in, and this title is added in title storehouse;
S1032, when the financial index corresponding to each actual name in actual name storehouse, in title storehouse When having corresponding title, then the actual name in actual name storehouse is referred to as with the standard name in title storehouse Between financial index correspondence, establish the actual name that obtains in actual name storehouse and be referred to as with the standard name in title storehouse Between title mapping relations, for example, the actual name corresponding to financial index A includes title a1, title a2 and title a3, and Title corresponding to financial index A is b1, at this point, then establishing for financial index A, actual name is referred to as with standard name Between mapping relations;
S104, according to title mapping relations, build the regular expression rule of financial index;
Specifically, according to the title mapping relations between the actual name of financial index and title, each is formulated The regular expression rule of financial index, wherein, the regular expression rule of the financial index is referred to based on canonical Financial index is identified the recognition rule of judgement in expression formula.
Step S2:It obtains the initiation feature index of financial statement and terminates characteristic index.
Specifically, the step S2 is comprised preferably:
S201, multiple documents for including financial statement are extracted;
Specifically, the type of the financial statement includes balance sheet, profit flow table, cash flow statement this three categories type Financial statement, therefore, it is necessary to be directed to the financial statement of each type, randomly select several texts for including actual financial statement Shelves, for example, for the financial statement of this type of balance sheet, randomly select several texts for including actual assets liability account Shelves;
S202, index extraction is carried out to the starting content of the financial statement in each document, then, according to what is extracted The frequency of occurrence (i.e. occurrence number) of index, according to order from big to small, is ranked up the index extracted, m1 before selection A index builds to obtain initiation feature index list;For example, the index extracted from these documents has q1, q2, q3, and refer to The occurrence number for marking q1 is 7, the occurrence number of index q2 is 8, the occurrence number of index q3 is 4, then 2 index structures before selection It builds to obtain initiation feature index list, that is, includes index q2 and q2;
Specifically, by using above-mentioned steps S202, can build to obtain the starting corresponding to different type financial statement Characteristic index list for example, by step S201, randomly selects several documents for including actual assets liability account, then right The starting content of balance sheet in these each documents extracted carries out index extraction, then, according to the finger extracted Target frequency of occurrence (i.e. occurrence number) according to order from big to small, is ranked up the index extracted, m1 before selection Index builds to obtain the initiation feature index list corresponding to the financial statement of this type of balance sheet;And for profit Initiation feature index list corresponding to the financial statement of this two type of table, cash flow statement, building mode are identical with this;Its In, for the initiation feature index list that structure obtains, the initiation feature index of the as required financial statement acquired;
S203, index extraction is carried out to the end content of the financial statement in each document, then, according to what is extracted The frequency of occurrence of index according to order from big to small, is ranked up the index extracted, and m2 index is built before selection To end characteristic index list;
Specifically, for this step end characteristic index list structure, mode and above-mentioned initiation feature index arrange The building mode of table is similar, is then not set forth in detail herein;Therefore, by above-mentioned steps S203, it can build to obtain assets and bear End characteristic index list corresponding to the financial statement of this three categories type of debt table, profit flow table, cash flow statement;Wherein, for Obtained end characteristic index list is built, is the end characteristic index of the required financial statement acquired.
Preferably for above-mentioned m1 and m2, their numerical value is identical.
Step S3:Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to difference The document of form carries out the localization process of financial statement.
Specifically, the step S3 is comprised preferably:
S301, the precedence occurred according to financial statement form are each financial statement form configuration in document Corresponding ID;
Specifically, for each document, according to the precedence that each financial statement form occurs in a document, it is The financial statement form occurred establishes incremental ID, that is to say, that the ID of financial statement form illustrates financial statement table The order that lattice occur in a document, that is, the ID of financial statement form illustrates what financial statement form occurred in a document Priority position is each wealth after arriving first for example, according to the precedence that each financial statement form occurs in a document Be engaged in the configuration of report form corresponding ID, such as ID1、ID2、ID3、ID4、ID5、……、IDK, it is seen then that ID1Financial statement form go out Now prior to ID2Financial statement form, that is to say, that in a document, ID1Financial statement form be located at ID2Financial statement table Before lattice;
S302, regular expression rule, initiation feature index and end characteristic index using financial index, in document Each financial statement form carry out analysis judgment, so as to position draw initiate table lattice and terminate form, wherein, the starting Form refers to the financial statement form in financial statement initial position, and the end form is referred in financial statement knot The financial statement form of beam position;
Specifically, the step S302 is specifically included:
S3020, set balance sheet, profit flow table, cash flow statement this three categories type financial statement form starting mark Will is respectively asset_begin_sign, profit_begin_sign, cash_begin_sign, and initial value is False (falsity);
S3021, when the beginning flag of form is falsity, using the regular expression rule of financial index, from current Identification extracts n1 index in the starting content of financial statement form, then, will identify the n1 index extracted and starting After index in characteristic index list is matched, the first matching rate is obtained, then, when the first matching rate is more than first threshold When, then current financial statement form is judged to originate form;
Specifically, when the value of the beginning flag begin_sign of the financial statement form of three categories type is False, then Using the regular expression rule of financial index, identified from the starting content of current financial statement form and extract n1 finger Mark (n1<M1), then, will identify n1 index extracting respectively with the financial statement institute of the three categories type obtained in step S2 Corresponding initiation feature index list is matched, if the n1 index extracted and one of initiation feature index list (when such as the matching rate of list a) is higher than first threshold r1, then it is assumed that financial statement of the financial statement form corresponding to list a The starting form of type, such as, then it is assumed that the financial statement form is the starting form of balance sheet, and records finance report The ID of table form is the starting form ID for corresponding to financial statement type, while will correspond to the financial statement form of financial statement type Beginning flag begin_sign be set to True (true value), if for example, the financial statement form be balance sheet initiate table During lattice, then asset_begin_sign is set to True;
S3022, when the beginning flag of form is true value, using the regular expression rule of financial index, from current Identification extracts n2 index in the end content of financial statement form, then, the n2 index extracted will be identified with terminating After index in characteristic index list is matched, the second matching rate is obtained, then, when the second matching rate is more than second threshold When, then current financial statement form is judged to terminate form;
Specifically, the value there are one in the beginning flag begin_sign of the financial statement form of three categories type is True, then, then using the regular expression rule of financial index, identified from the end content of current financial statement form Extract n2 index (n2<M2), then, n2 index the extracting three categories type with being obtained in step S2 respectively will be identified Financial statement corresponding to end characteristic index list matched, if the n2 index extracted with it is one of terminate it is special Levying index list, (such as the matching rate of list b) is higher than second threshold r2, then it is assumed that the financial statement form is corresponding to list b The end form of financial statement type, such as, then it is assumed that the financial statement form is the end form of balance sheet, and records Financial statement form ID is the end form ID of corresponding financial statement type, while the finance of corresponding financial statement type is reported The beginning flag begin_sign of table form is set to False;
S303, the ID for originating form and the ID for terminating form according to financial statement, will originate the ID and sign-off table of form The data in the financial statement form corresponding to all ID (ID of the form containing starting and the ID for terminating form) between the ID of lattice, Mark is the financial data of financial statement type;
Specifically, if ID1Financial statement form be financial statement starting form, ID5Financial statement form for finance The end form of report, then be then ID by ID1To ID55 financial statement forms in data mark for finance report The financial data of table type, for example, the financial data labeled as balance sheet.
Preferably for above-mentioned n1 and n2, their numerical value is identical.
Step S4:To in financial statement data carry out localization process after, record financial data and with financial data pair The index name answered and time.
Specifically, for the data in financial statement, each data can be uniquely determined by index name and time;It is preferred that Ground, the step S4 include:
S401, first established between the index name corresponding to the data in financial statement form and place line number map Relation, i.e. first mapping relations refer to the mapping between the line number where index name and data corresponding to data Relation;
Wherein, if index name corresponding to the data, corresponding to financial index do not deposited in title storehouse When having corresponding title, then the index name is added in actual name storehouse and title storehouse, and increases the financial index newly Regular expression rule;
S402, second established between the temporal information corresponding to the data in financial statement form and place columns map Relation, i.e. second mapping relations refer to the mapping between the columns where temporal information and data corresponding to data Relation;
S403, the ranks number using corresponding to the data in financial statement form, the first mapping relations and the second mapping are closed System records financial data and index name corresponding with financial data and time;
Specifically, according to the ranks number of each data and " line number-index name " and " columns-temporal information " are passed through The two mapping relations, it may be determined that index name and time corresponding to data, then utilize the ranks number of data, " line number-refer to Entitling claims " and " columns-temporal information " the two mapping relations, can index name of the logarithm according to this and corresponding to data, when Between recorded.
Step S5:After carrying out unit conversion to the data for belonging to numeric type, the data obtained after conversion are recorded.
Specifically, the financial data recorded in above-mentioned steps S4 is the initial data (number presented in document According to), wherein, the data (referred to as numeric type data) for belonging to numeric type also need to carry out unit conversion, can just obtain data Actual value.Therefore, the step S5 is comprised preferably:
S500, the regular expression rule for building unit information;
Specifically, several financial documentations, the unit information form of presentation that analysis the inside is related to are randomly selected, and is directed to this A little different form of presentation establish the regular expression rule of unit information;
S501, the regular expression rule using unit information, identification draw the unit information of financial statement;
Specifically, this step preferably includes:
S5011, the data in financial statement form are traveled through, using the regular expression rule of unit information, sentenced Whether the data in disconnected identification financial statement form have unit information, if so, will be identified from financial statement form Unit information of the unit information as the required financial statement for identifying and drawing, conversely, then utilizing the regular expression of unit information After rule identifies the table title of financial statement, of unit information is carried out in the n3 character string after the table title identified With search, believe the unit information nearest apart from table title searched as the unit of the required financial statement for identifying and drawing Breath;
S502, the unit information drawn according to above-mentioned steps identification, carry out unit conversion processing to numeric type data, make number The data reduction of value type be using member as the financial data of monetary unit, and is replaced convert before data record get off.
Obtained by above-mentioned, the present invention included by the advantages of have:1st, between the actual name of financial index and title Mapping relations, come build draw financial index regular expression rule, can realize to financial index title in magnanimity document Fast and accurately identification, improve to document carry out financial data parsing treatment effeciency and accuracy;2nd, to a variety of not appositions The document of formula is sampled statistical analysis, establishes the initiation feature index list corresponding to three big financial statement types and terminates special Index list is levied, and the Index Content in list is in a document quickly and accurately by the initiation feature of form and end feature It identifies, is quick and precisely positioned so as to fulfill the automatic of financial statement form in document;3rd, wealth in financial statement form is established Index and the mapping relations and temporal information of line number of being engaged in and the mapping relations of columns, so can by " line number-index ", " columns-time " the two mapping relations determine the index name corresponding to data and time, to realize number in financial statement According to index and temporal information automatic quick positioning and record;4th, it is regular to establish the regular expression of unit information, and according to These regular expressions rule, to identify the relevant unit's information for parsing financial statement, in this way, the unit obtained according to parsing Information quickly and accurately can carry out corresponding unit conversion to the financial data that parsing is drawn, obtain the high data of accuracy Actual value;5th, with the increase of parsing number of files, make namebase that can more improve and accurate, so as in further promotion document The efficiency and accuracy rate of the positioning parsing of appearance.As it can be seen that wealth is carried out to the magnanimity document containing financial statement for the present invention is a kind of The automatic parsing scheme for data of being engaged in by building the recognition rule of the recognition rule of financial index, unit information, and is based on This rule orients the position of financial statement form in document, the index name in financial statement form corresponding to data automatically And temporal information and corresponding unit conversion is carried out to numeric type data, obtain the actual value of data, so as to fulfill to magnanimity not Document with form carries out the parsing of fast and accurately financial statement data, and not only treatment effeciency, accuracy are high, but also have High expansibility, using compatibility and comprehensive, be applicable in a variety of documents in various formats such as WORD, EXCEL, TXT.
In addition, for the document dissection process scheme of the present invention, it is suitable for the financial data solutions in enterprise annual reports document It analyses, issue debts and raise the parsing of the financial data in specification, the finance such as financial data parsing tracked in grading report of issuing debts number greatly According in parsing field.
All technology contents in the present embodiment can arbitrarily split/be applied in combination in above-described embodiment 1~3.
The above are implementing to be illustrated to the preferable of the present invention, but the invention is not limited to the implementation Example, those skilled in the art can also make a variety of equivalent variations on the premise of without prejudice to spirit of the invention or replace It changes, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims (10)

1. applied to the document analyzing and processing method in big data analysis, it is characterised in that:This method comprises the following steps:
Build the regular expression rule of financial index;
It obtains the initiation feature index of financial statement and terminates characteristic index;
Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to the document of different-format Carry out the localization process of financial statement;
After carrying out localization process to the data in financial statement, financial data and index name corresponding with financial data are recorded And the time;
After carrying out unit conversion to the data for belonging to numeric type, the data obtained after conversion are recorded.
2. it is applied to the document analyzing and processing method in big data analysis according to claim 1, it is characterised in that:The structure Build financial index regular expression it is regular the step for, specifically include:
Title storehouse is obtained, the title of financial index is stored in the title storehouse;
Actual name storehouse is built, the actual name of the financial index extracted from document is stored in the actual name storehouse Claim;
Establish the title mapping relations between the title in the actual name in actual name storehouse and title storehouse;
According to title mapping relations, the regular expression for building financial index is regular.
3. it is applied to the document analyzing and processing method in big data analysis according to claim 1, it is characterised in that:It is described to obtain The step for taking the initiation feature index of financial statement and terminating characteristic index, specifically includes:
Extract multiple documents for including financial statement;
Index extraction is carried out to the starting content of the financial statement in each document, then, according to going out for the index extracted The existing frequency, according to order from big to small, is ranked up index, and m1 index builds to obtain initiation feature index row before selection Table;
Index extraction is carried out to the end content of the financial statement in each document, then, according to going out for the index extracted The existing frequency, according to order from big to small, is ranked up index, and m2 index, which builds to obtain, before selection terminates characteristic index row Table.
4. it is applied to the document analyzing and processing method in big data analysis according to claim 3, it is characterised in that:The profit With the regular expression rule of financial index, initiation feature index and terminate characteristic index, wealth is carried out to the document of different-format It the step for localization process of business report, specifically includes:
It is that each financial statement form in document configures corresponding ID according to the precedence that financial statement form occurs;
Using the regular expression rule of financial index, initiation feature index and terminate characteristic index, to each in document Financial statement form carries out analysis judgment, and initiate table lattice and end form are drawn so as to position, wherein, what the starting form referred to It is the financial statement form in financial statement initial position, the form that terminates is referred in financial statement end position Financial statement form;
According to the ID of starting form and the ID of end form, all ID institutes between the ID of form and the ID of end form will be originated Data in corresponding financial statement form, mark are the financial data of financial statement type.
5. it is applied to the document analyzing and processing method in big data analysis according to claim 4, it is characterised in that:The profit With the regular expression rule of financial index, initiation feature index and terminate characteristic index, each finance in document are reported Table form carries out analysis judgment, so as to position the step for drawing initiate table lattice and terminating form, specifically includes:
When the beginning flag of form is falsity, using the regular expression rule of financial index, from current financial statement table Identification extracts n1 index in the starting content of lattice, then, will identify that the n1 index extracted is arranged with initiation feature index After index in table is matched, the first matching rate is obtained, then, when the first matching rate is more than first threshold, then judges to work as Preceding financial statement form is starting form;
When the beginning flag of form is true value, using the regular expression rule of financial index, from current financial statement table Identification extracts n2 index in the end content of lattice, then, the n2 index extracted will be identified with terminating characteristic index row After index in table is matched, the second matching rate is obtained, then, when the second matching rate is more than second threshold, then judges to work as Preceding financial statement form is end form;
Wherein, n1<M1, n2<m2.
6. being applied to the document analyzing and processing method in big data analysis according to claim any one of 1-5, feature exists In:After the data in financial statement carry out localization process, financial data and index corresponding with financial data are recorded It the step for title and time, specifically includes:
Establish the first mapping relations between the index name corresponding to the data in financial statement form and place line number;
Establish the second mapping relations between the temporal information corresponding to the data in financial statement form and place columns;
Using ranks number, the first mapping relations and the second mapping relations corresponding to the data in financial statement form, wealth is recorded Data of being engaged in and index name corresponding with financial data and time.
7. being applied to the document analyzing and processing method in big data analysis according to claim any one of 1-5, feature exists In:It is specific to wrap the step for the data obtained after record conversion after the described pair of data for belonging to numeric type carry out unit conversion It includes:
Using the regular expression rule of unit information, identification draws the unit information of financial statement;
According to the unit information that identification is drawn, unit conversion processing is carried out to the data for belonging to numeric type, order belongs to numeric type Data reduction is the financial data in units of member and records.
8. it is applied to the document analyzing and processing method in big data analysis according to claim 7, it is characterised in that:The profit With the regular expression rule of unit information, identify the step for drawing the unit information of financial statement, specifically include:
Data in financial statement form are traveled through, using the regular expression rule of unit information, judge identification finance Whether the data in report form have unit information, if so, the unit information identified from financial statement form is made The unit information of financial statement drawn is identified to be required, conversely, then being identified using the regular expression rule of unit information After the table title of financial statement, the matching search of unit information is carried out in the n3 character string after the table title identified, it will Unit information of the unit information nearest apart from table title searched as the required financial statement for identifying and drawing.
9. applied to the document dissection process system in big data analysis, it is characterised in that:The system includes:
Construction unit, for building the regular expression of financial index rule;
Acquiring unit, for obtaining the initiation feature index of financial statement and terminating characteristic index;
First positioning unit, for utilizing the regular expression of financial index rule, initiation feature index and terminating characteristic index, The localization process of financial statement is carried out to the document of different-format;
Second positioning unit, for in financial statement data carry out localization process after, record financial data and with finance The corresponding index name of data and time;
Scaling unit after carrying out unit conversion to the data for belonging to numeric type, records the data obtained after conversion.
10. applied to the document dissection process device in big data analysis, it is characterised in that:The device includes:
At least one processor;
At least one processor, for storing at least one program;
When at least one program is performed by least one processor so that at least one processor is realized as weighed Profit requires the document analyzing and processing method being applied to described in any one of 1-8 in big data analysis.
CN201711080717.1A 2017-11-06 2017-11-06 Document analysis processing method, system and device applied to big data analysis Active CN108073678B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711080717.1A CN108073678B (en) 2017-11-06 2017-11-06 Document analysis processing method, system and device applied to big data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711080717.1A CN108073678B (en) 2017-11-06 2017-11-06 Document analysis processing method, system and device applied to big data analysis

Publications (2)

Publication Number Publication Date
CN108073678A true CN108073678A (en) 2018-05-25
CN108073678B CN108073678B (en) 2020-08-28

Family

ID=62159707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711080717.1A Active CN108073678B (en) 2017-11-06 2017-11-06 Document analysis processing method, system and device applied to big data analysis

Country Status (1)

Country Link
CN (1) CN108073678B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119423A (en) * 2019-05-17 2019-08-13 厦门商集网络科技有限责任公司 A kind of data analysis method and computer readable storage medium of configurableization
CN112434496A (en) * 2020-12-11 2021-03-02 深圳司南数据服务有限公司 Method and terminal for identifying form data of bulletin document
CN114282292A (en) * 2021-12-23 2022-04-05 广东景龙建设集团有限公司 BIM platform-based virtual decoration method and system, and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060075215A1 (en) * 2004-09-25 2006-04-06 Lsi Logic Corporation Configurable BIOS strings
CN102063493A (en) * 2010-12-30 2011-05-18 北京大学 Content extraction method based on regular expression group and control logic
CN102143148A (en) * 2010-11-29 2011-08-03 华为技术有限公司 Parameter acquiring and general protocol analyzing method and device
CN103914440A (en) * 2014-03-06 2014-07-09 广东电网公司电网规划研究中心 Intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents
CN104217025A (en) * 2014-09-28 2014-12-17 福州大学 System and method for extracting record items of multi-record web page
CN104462255A (en) * 2014-11-21 2015-03-25 山东航天信息有限公司 Comprehensive budget design and compilation system allowing format and data to be separated
CN104731941A (en) * 2015-03-31 2015-06-24 浪潮集团有限公司 Method for capturing data from unstructured financial report based on XBRL technology
CN105095186A (en) * 2015-07-28 2015-11-25 百度在线网络技术(北京)有限公司 Semantic analysis method and device
CN105243188A (en) * 2015-09-14 2016-01-13 江苏科能电力工程咨询有限公司 Automatic screening method for monitoring quantity of information of intelligent transformer substation
CN106445910A (en) * 2015-09-02 2017-02-22 深圳市览网络股份有限公司 Document analysis method and apparatus

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060075215A1 (en) * 2004-09-25 2006-04-06 Lsi Logic Corporation Configurable BIOS strings
CN102143148A (en) * 2010-11-29 2011-08-03 华为技术有限公司 Parameter acquiring and general protocol analyzing method and device
CN102063493A (en) * 2010-12-30 2011-05-18 北京大学 Content extraction method based on regular expression group and control logic
CN103914440A (en) * 2014-03-06 2014-07-09 广东电网公司电网规划研究中心 Intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents
CN104217025A (en) * 2014-09-28 2014-12-17 福州大学 System and method for extracting record items of multi-record web page
CN104462255A (en) * 2014-11-21 2015-03-25 山东航天信息有限公司 Comprehensive budget design and compilation system allowing format and data to be separated
CN104731941A (en) * 2015-03-31 2015-06-24 浪潮集团有限公司 Method for capturing data from unstructured financial report based on XBRL technology
CN105095186A (en) * 2015-07-28 2015-11-25 百度在线网络技术(北京)有限公司 Semantic analysis method and device
CN106445910A (en) * 2015-09-02 2017-02-22 深圳市览网络股份有限公司 Document analysis method and apparatus
CN105243188A (en) * 2015-09-14 2016-01-13 江苏科能电力工程咨询有限公司 Automatic screening method for monitoring quantity of information of intelligent transformer substation

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119423A (en) * 2019-05-17 2019-08-13 厦门商集网络科技有限责任公司 A kind of data analysis method and computer readable storage medium of configurableization
CN112434496A (en) * 2020-12-11 2021-03-02 深圳司南数据服务有限公司 Method and terminal for identifying form data of bulletin document
CN112434496B (en) * 2020-12-11 2021-06-22 深圳司南数据服务有限公司 Method and terminal for identifying form data of bulletin document
CN114282292A (en) * 2021-12-23 2022-04-05 广东景龙建设集团有限公司 BIM platform-based virtual decoration method and system, and storage medium

Also Published As

Publication number Publication date
CN108073678B (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN110443458A (en) Methods of risk assessment, device, computer equipment and storage medium
CN104281563A (en) Method and system for discovering relationships in tabular data
CN107704512A (en) Financial product based on social data recommends method, electronic installation and medium
CN105378732A (en) Subject-matter analysis of tabular data
US11263523B1 (en) System and method for organizational health analysis
CN112926299B (en) Text comparison method, contract review method and auditing system
CN112163072B (en) Data processing method and device based on multiple data sources
CN108073678A (en) Applied to document analyzing and processing method, system and the device in big data analysis
US8972328B2 (en) Determining document classification probabilistically through classification rule analysis
CN111177332A (en) Method and device for automatically extracting referee document case-related mark and referee result
CN113902009A (en) Resume analysis method and device, electronic equipment, medium and product
CN114692628A (en) Sample generation method, model training method, text extraction method and text extraction device
CN107943785B (en) PDF document processing method and device based on big data
US20170103059A1 (en) Method and system for preserving sensitive information in a confidential document
CN110377891B (en) Method, device and equipment for generating event analysis article and computer readable storage medium
JP7272037B2 (en) Information processing device and program
CN114595661B (en) Method, apparatus, and medium for reviewing bid document
CN107402925A (en) Information-pushing method and device
KR20210001649A (en) A program for predicting corporate default
CN113627892B (en) BOM data identification method and electronic equipment thereof
CN115712730A (en) Resume matching method and system
CN109670183A (en) A kind of calculation method, device, equipment and the storage medium of text importance
CN109064191A (en) Vehicle source data analytic method, device and electronic equipment
CN110909112B (en) Data extraction method, device, terminal equipment and medium
CN112925910A (en) Method, device and equipment for assisting corpus labeling and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant