WO2016060551A1 - A method for mining electronic documents and system thereof - Google Patents

A method for mining electronic documents and system thereof Download PDF

Info

Publication number
WO2016060551A1
WO2016060551A1 PCT/MY2015/050126 MY2015050126W WO2016060551A1 WO 2016060551 A1 WO2016060551 A1 WO 2016060551A1 MY 2015050126 W MY2015050126 W MY 2015050126W WO 2016060551 A1 WO2016060551 A1 WO 2016060551A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
user
data
documents
processing module
Prior art date
Application number
PCT/MY2015/050126
Other languages
French (fr)
Inventor
Kim Seng Kee
Keong Hway CHHUA
Muhammad Hilmi YUSOFF
Original Assignee
Kim Seng Kee
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kim Seng Kee filed Critical Kim Seng Kee
Publication of WO2016060551A1 publication Critical patent/WO2016060551A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Definitions

  • the invention relates to a method and system of mining data. More particularly, the invention relates to a method and system for mining electronic documents from a plurality of database servers in real time without the need for going through a Extract, Transform, and Load (ETL) process.
  • ETL Extract, Transform, and Load
  • the legacy system that uses RDBMS as its database cannot store documents and folios, the documents and folios must be split into tables with different primary keys and foreign keys to be able to store onto RDBMS.
  • the thousands of table designs must include all input tables, intermediate tables and output tables is a very complicated undertaking.
  • the thousands of tables complicate systems development life cycle (SDLC) & complicates data managements. Therefore, to provide a one view of customer folio, general ledger folio, stock folio, etc. is a difficult effort as the data must be transformed between each process (web, transmission, storage, processing (batch, online, BPM), data-warehousing, data-mining, output.
  • SDLC development life cycle
  • Legacy system it is difficult to integrate data & system because it is usually developed by different group as Legacy systems involve thousands of tables.
  • the Documents, Flows, Business Rules & Codes are often lost in the process, therefore difficult for business personnel to talk to the computer people.
  • Emulating Manual System eMS
  • eDoc Electronic Document
  • the invention provides a method for mining data from documents stored in a Electronic Document (eDoc) format in real time without the need for going through a Extract, Transform, and Load (ETL) process, comprises the steps of: receiving, by a Question Formulation Module, search criteria and search filtering configurations from a user; establishing, by a Parallel Processing Module, one or more communication links between a user server and one or more database servers; transmitting, by the Question formulation Module, the search criteria and search filtering configurations of the user to a Data Processing Module for generating filtering rules based on the received information and applying the generated rules to retrieve related documents from the one or more database servers; extracting, by the Data Processing Module, specific information requested by the user from the retrieved documents; and displaying, by a Display Output Module, the outputs of the extraction.
  • the Data Processing Module includes a Read Module and a Retrieval Module.
  • the method further comprises the steps of:
  • SUBSTITUTE SHEETS (RULE 26) accumulatively retrieving, by a Cumulative Summary Module, related documents and extracting specific information from the retrieved documents over a predetermined period of time; and generating, by the Cumulative Summary Module, a summary of the extracted information.
  • the Question Formulation Module is configured to execute the instructions of: determining, by the user, search criteria and configurations; and transmitting the selected criteria to the Data Processing Module upon the establishment of the communication link between the user server and the database servers.
  • the Parallel Processing Module is configured to execute the instructions of: determining the database server for data mining assignment; checking the availabilities of the selected database servers; establishing communication links between the user server and the database servers; activating the Data Processing Module to receive search criteria and configurations from the Question Formulation Module; and compiling and updating the outcomes of each server to a control ledger.
  • the Read Module is configured to execute the instructions of: receiving inputs from the Question formulation Module; retrieving the documents based on the inputs of the user from the database servers; and activating the Retrieval Module to process the documents.
  • the Retrieval Module is configured to execute the instructions of extracting specified information from retrieved documents based on the inputs of the user.
  • the Cumulative Summary Module is configured to execute the instructions of: receiving predefined input consist of ledger identifier, column name, value and mode; checking if mode of the module is update mode; in the event of the module is in an update mode: locating a eSummary ledger based on the ledger identifier from the predefined input; and updating a specified column with the latest values if eSummary
  • SUBSTITUTE SHEETS (RULE 26) ledger is available; in the event of the module is not in an update mode, locating the eSummary ledger; and extracting the specified information from the specified column.
  • the embodiment of the invention discloses a system for inquiring or mining data from documents stored in a Electronic Document (eDoc) format in real time without the need for going through a Extract, Transform, and Load (ETL) process, comprises: a Question Formulation Module for receiving search criteria and search filtering configurations from a user; a Parallel Processing Module for establishing one or more communication links between a user server and one or more database servers; a Data Processing Module configured to receive the search criteria and search filtering configurations of the user from the Question formulation Module, to generate filtering rules based on the received information, to apply the generated rules to retrieve related documents from the one or more database servers, and to extract specified information requested by the user from the retrieved documents; and a
  • the Data Processing Module includes a Read Module and a Retrieval Module.
  • the Question Formulation Module includes: a computer-executable instruction for generating a list of choices for the user to select, wherein the choices include information relating to department, repository, document, attribute, time, and parameter; and a computer-executable instruction for transmitting the selected criteria to the Data Processing Module upon the establishment of the communication link between the user server and the database servers.
  • the Parallel Processing Module includes: a computer-executable instructions for determining the database server for data mining assignment; a computer-executable instructions for checking the availabilities of the selected database servers; a
  • SUBSTITUTE SHEETS (RULE 26) computer-executable instruction for establishing communication links between the user server and the database servers; a computer-executable instruction for activating the Data Processing Module; and a computer-executable instruction for compiling and updating the outcomes of each server to a control ledger.
  • the Read Module includes: a computer-executable instruction for receiving inputs from the Question formulation Module; a computer-executable instruction for retrieving the documents based on the inputs of the user from the database servers; and a computer-executable instruction for activating the Retrieval Module to process the documents.
  • the Retrieval Module includes a computer-executable instruction for extracting specified information from retrieved documents based on the inputs of the user.
  • the system further includes a Cumulative Summary Module, the module comprises: a computer-executable instruction for accumulatively retrieving related documents and extracting information from the retrieved documents over a predetermined period of time; and a computer-executable instruction for generating a summary of the extracted information.
  • the invention provides a platform where users can search and view any data available in multiple databases and in its original form (a document). This is because of the uniqueness of the Emulating Manual System Database (eMS), an Account-Centric storage, in which it does not implement the legacy way of storing data (by separating data into segments). All data stored in eMS Database are in their original form (a document) and all of them are stored in only one database.
  • eMS Emulating Manual System Database
  • eMS data mining In eMS data mining, it does not require ETL process to build a data warehouse because all data available represent a complete history. As a result, the eMS database itself can act as a Data Warehouse. Since the eMS database can act as a Data Warehouse, the users can through the platform do data mining in real-time and the platform can request for
  • SUBSTITUTE SHEETS (RULE 26) any information as long as the data is available anytime anywhere the users requested. Furthermore, information retrieves using this platform is SOX compliant. In other words, it can be proven that from which source that the retrieved data come from.
  • FIG. 1 illustrates a flow chart for creating a Electronic Document (eDoc) template string.
  • eDoc Electronic Document
  • FIG. 1 illustrates a flow chart of extracting process from a Electronic Document (eDoc) String.
  • eDoc-Header illustrates a flow chart of retrieving process for data from the Column of extracted Electronic Document Header (eDoc-Header).
  • eDoc-Header illustrates a flow chart of updating process for data from the retrieved row index and column index.
  • SUBSTITUTE SHEETS illustrates a flow chart of uploading process for a Electronic Document (eDoc) String.
  • eFile illustrates a flow chart of Mapping process of a Electronic Document (eDoc) for a Electronic File (eFile).
  • FIG. 11 illustrates an example eLedger containing details of a customer profile and item details. illustrates an example creation of subDoc based on the example as in Figure 11. illustrates an example of how eFiles store in a RDBMS Table, illustrates a Transaction Processing System.
  • SUBSTITUTE SHEETS (RULE 26) illustrates the master flowchart of Data Mining program. illustrates the process of Question Formulation module where users can do data inquiry. illustrates the process of parallel computing where it assigns the server location and handles the result retrieval from different server. illustrates Data Processing module to retrieve eDoc(s) from database and get the data from the retrieved eDoc(s). illustrates the process of data retrieval from or data update to cumulative summary ledger. illustrates the process of output display to user's screen, is a block diagram of the Data Mining system.
  • Data is stored in a format called Electronic Document (eDoc), which serves as the display, storage, processing, and transmission format throughout the systems development life cycle, without transformation at any stage.
  • eDoc Electronic Document
  • Data can be imported from or exported to any format including PDF, XML, XLS and CSV.
  • SUBSTITUTE SHEETS (RULE 26) An Electronic File (eFile) stores eDocs (with all data file types) on a database (RDBMS). Filing System predominantly utilizes the database read, write and index functions only. Therefore it can utilise almost all popular RDBMS, and if necessary can handle any customised, in-house database systems.
  • the system to emulate manual filing system for storing and processing document that operates on Relational Database Management System comprising ; a String Template (1) having at least one details of document number, number of sections and number of rows defined based on at least one Input; a String Module (2) for generate a Electronic Document (eDoc) (11) having at least one Electronic Document Identifier (eDoc-Identifier), Section, Rowtype and Column by validating the document number, number of sections and number of rows based on the String Template (1); and a Extraction Module (3) for extracting the Electronic Document Identifier (eDoc-Identifier), Section, Rowtype and Column of Electronic Document (eDoc) (11) generated by the String Module (2) for retrieval process.
  • RDBMS Relational Database Management System
  • the system also includes a Retrieval Module (4) for retrieving at least one Retrieved Data from the data of Electronic Document (eDoc) (11) stored in the database based on at least one Input of the Section, Rowtype and Column; a Updating Module (5) for updating the Retrieved Data of Electronic Document (eDoc) (11) and store at least one Updated Data to the database based on the Input of Section, Rowtype and Column defined; and a Formation Module (6) for forming the updated Electronic Document (eDoc) (11) by retrieving the Updated Data based on the Input of Section, Rowtype and Column.
  • a Retrieval Module (4) for retrieving at least one Retrieved Data from the data of Electronic Document (eDoc) (11) stored in the database based on at least one Input of the Section, Rowtype and Column
  • a Updating Module (5) for updating the Retrieved Data of Electronic Document (eDoc) (11) and store at least one Updated Data to the database based on the Input of Section, Rowtype and Column defined
  • the system has a Paging Module (7) for append Electronic Document (eDoc) (11) in the database into at least one Electronic File (eFile) (13) according to a predefined Page limit; a Indexing Module (8) for forming at least one Index to the Electronic File (eFile) (13) based-on document identifier, date, end sequence number, document status, document offset and document length; and a Read Module (9) for retrieving the Index and at least one Data Relative Page (Page 0) of the Electronic File (eFile) (13) based on at least one Read
  • the system further includes a Mapping Module (10) for updating at least one Retrieved Data based on at least one Mapping Input by determining the Electronic File (eFile) (13) using the Read Module (9) to retrieve the Retrieved Data of Electronic Document (eDoc) (11) using the Retrieval Module (4), in which the Updating Module (5) update the Retrieved Data to the database and forming the Retrieved Data into the Electronic Document (eDoc) (11) using the Formation Module (6) for updating into at least one Electronic File (eFile) (13) using Paging Module (7) and forming at least one Index using the Indexing Module (8); and a Enquiry Module (14) for retrieving a pluralities of Electronic Document (eDoc) (11) information using a Read Module (10) based on at least one Information for the Electronic Document Identifier (eDoc-Identifier), Section, Rowtype and Column of Electronic Document (eDoc) (11), in which the retrieved Electronic Document
  • the system or module are initiated by creating a template by defining a new document and indicate number of sections and number of rows required, which will be defined by an input from a user or a database (101). Then, the system or module creates a Electronic Document (eDoc) based on the document number defined from the input (102). Thereafter, the system or module interprets the eDoc Identifier for further processing, where the system or module validates how many sections or Rowtype defined in the input to form a eDoc template string (103). If there is section defined in the input, the system or module will create a new section and define or classify (label) the section if it is the 1 st Section in the input (106).
  • eDoc Electronic Document
  • the system or module validates is there any Rowtype defined in the input (107). If there is Rowtype defined in the input, the system or module will create a new Rowtype based on a Predefined Dictionary (108). The system or module further will validates if there is any other Section or Rowtype defined in the input for further processing (104,107). If there no other Section or Rowtype defined in the input, the
  • SUBSTITUTE SHEETS (RULE 26) system or module will further process to generate the eDoc template string with a document number (105).
  • the system or module are initiated by retrieving a Electronic Document (eDoc) String for extraction process (201), where the system extract a eDoc Identifier from the eDoc String having a document number (202). Then, the system will determine the section of the eDoc Identifier predefined in the eDoc String (203). If there is section defined in the eDoc Identifier (204), the system will split the section into Rowtype for further processing (205). Then, the Rowtype will split into column of data (207), where the column of data will be stored into a Database (208). The system further will validates if there is any other Section or Rowtype defined in the input for further processing (204,206). If there no other Section or Rowtype defined in the input, the system will further process for retrieval of data in the Column.
  • eDoc Electronic Document
  • the system or module are initiated by receiving input from a database or a user (301). Then, the system validates the Section and Rowtype based on the receiving input for retrieval process (302,303). If valid section is determined, the system validates the Rowtype. If the valid Rowtype is determined, the system will locate the row index and column index (304). Then, the data is retrieved from the located row index and column index (305) and outputs the results for further processing (306).
  • the system or module are initiated by receiving input from a database or a user (401). Then, the system validates the Section and Rowtype based on the receiving input for updating process (402,403). If valid section is determined, the system validates the Rowtype. If the valid Rowtype is determined, the system will locate the row index and column index (404). Then, the data is updated to the located row index and column index based on the input received (405). Then, the updated
  • the system or module are initiated by retrieving the updated data stored in a Database for uploading process (501), where the system append the Electronic Document (eDoc) to output (502). Then, the systems will determine the number of section from the updated data stored in a Database to be assembled. If there is section defined in updated data, the systems will retrieve all the section from the updated data stored in a Database (503). Then, the system will retrieve all Rowtype for further processing (506). The system further will validates if there is any other Section or Rowtype defined in the updated data stored for further processing (504,507). If there no other Section or Rowtype defined in the input, the system will further proceed to append the retrieved Section, Rowtype and values (508) to be uploaded into a eDoc String (505).
  • eDoc Electronic Document
  • the system or module are initiated by receiving input from a database or a user, in which it contains information such as: ledger identifier, document identifier, account 1 and account 2 and eDoc (601). Then, the system validates with the database if this account is a new account (602). If it's not a new account, the system retrieves the existing Page from the database for later processing (603) and append eDoc form input to the eDoc from Page (604). However, if it's a new account, the system validate if the length of the combined eDoc is greater than the Page limit (605).
  • each Index will be formed based-on document identifier, date, end sequence number, document status, document offset and document length (606). Then, store Page and Index into database (607).
  • the system or module are initiated by receiving input from a database or a user, in which it contains information such as: document identifier, date,
  • SUBSTITUTE SHEETS (RULE 26) end sequence no, document status, document offset and document length (701). Then, forming an Index by combining all input as a string and each input is separated by colon (:) (702) and return the formed Index to the system that triggered this operation (703).
  • the system or module are initiated by receiving input from a database or a user, in which it contains information such as: ledger identifier, document identifier, account 1 and account 2 (801). Then, the system retrieves Index (indexes) and DATA of Relative Page for a given or specified account from a eFile stored in the database (802). Thereafter, parse the Index into individual index for processing (803). Then, the system retrieves a index that contains document identifier or information from the input received (804). Then, the system validates if there matching indexes from the input received (805). If found, the system will extract the offset and the length of the target eDoc (806) and the system further extract the eDoc from DATA of Relative Page (807). Then, output the extracted eDoc to the database or user requested (808).
  • the system or module are initiated by receiving input from a database or a user, such as source or details of eDoc (901). Then, parse source eDoc for further processing (902). Then, identify and load destination eDoc (for later updating) (903). Then, loading predetermined mapping instructions (904). Then, the system validate if the data of the predetermined source column fulfill the predetermined requirement (907), if there are more mapping instruction. Then, perform computation on the data of the predetermined source column if there is one more mapping instruction (908). However, if there are no further mapping instruction, the system store the updated destination eDoc back into database (906). Thereafter, the system will sum up the result from the computation with the data of the predetermined destination column and update the final result to the predetermine destination column (909). This process will be carried on till there is no more
  • eDoc Filing System account-centric system that acts as a display, transmission, storage and processing medium from end to end without requiring any other transformation or normalization.
  • Electronic File is an electronic folio (similar to a file in conventional manual filing systems) where all types of documents with different data types can be stored together in an account-centric manner.
  • the Filing system logically stores all data and information that relate to a single account in an Electronic File (eFile), in chronological order.
  • the Electronic Document (eDoc) are stored as sequential strings of data mapped to a data dictionary, and may include multiple data types in each string (e.g. image files, binary files, comma separated format, XML or any of the nearly 500 data formats in existence today). This allows the storage of any type of data within one record.
  • the way eDoc stores its data provides near real-time data mining without the need for data modeling.
  • eDoc is a data storage format comprising strings containing multiple rows each preceded by a unique row code: RxxV - Rxx being the row# and V the version#. Multiple rows of data of various rows make an eDoc.
  • eDoc is designed for change. Various versions of RxxV and DxxV can exist concurrently. eDoc can be converted to XML and vice versa. eDoc is similar to XML as its data also has separators and identifiers and tags, but eDoc has additional system fields that provide new functionality. If required, XML is used as a universal transmission document and passed to other systems, where data can be
  • SUBSTITUTE SHEETS (RULE 26) normalized to tables.
  • the table 1.0 and 2.0 further describes the terminators (separator) and identifiers and tags.
  • the Document Identifier (such as RID0) will only contain one or the whole Document, in which the Document Identifier is stored in the first Section.
  • the Document Identifier contains details such as creator details, document details, update history, attributes and etc.
  • the eDoc String data structure is also an Nth-dimension data structure where another eDoc String can be encapsulated within the u[ ... u] and stored in a Column.
  • the LDSRC Codes is also representing the GIS of an eDoc String stored. To retrieve the eDoc String, the LDSRC Codes are used to locate them.
  • the Electronic Dictionary (eDict) or metadata is used to describe the attribute/behavior of each ledger (LxxV), document (DxxV) and Rowtype (RxxV).
  • LxxV level the ledger identifier, eDoc updating methods (FIFO, LIFO, Update or Overwrite) and number of eDoc to be kept in eLedger is predefined in Ledger type eDict.
  • DxxV level the document type to be or can be stored is predefined in the Document type eDict.
  • the Rowtype type eDict is categorized into 3 parts; first, general attributes such as name, data type, data length and so forth; second, display attributes such as font type, size, color and so forth; third, computation attributes like data validation and computation.
  • the table 3.0, 4.0 and 5.0 shows an example of metadata or library predefined for Ledger, Document and Rowtype.
  • Electronic Ledger is where summaries or derivatives of eFile that is kept in variable length or fixed length format thus allowing for greater flexibility and fast retrieval.
  • the Information in eLedger can be deleted and modified.
  • Each eFile can have multiple eLedgers if required (for speedy reporting purposes).
  • the update method of each eDoc to the eLedger is predefined in eLedger dictionary.
  • the eLedger can contain n copies of eDoc that arrange in FIFO or LIFO manner; or new eDoc can override the exiting eDoc in the eLedger; or the update only manipulate data from certain column(s) in eDoc with the predefine column(s) in eLedger.
  • the system may further include Zero Balancing function where every transaction can be traced and no information is ever deleted, which means everything will be balanced (always balance to last cent). All transactions have a copy in the Transaction Ledger, so changes to any account are immediately verifiable and problems isolated.
  • the system also may make the system naturally SOX Compliant (Sarbanes-Oxley Act of 2002). The system may
  • SUBSTITUTE SHEETS (RULE 26) further include Reverse Processing where a new eLedger can be generated or regenerated from eFile based on new configuration or updated configuration.
  • the eLedger contains example customer profile that includes customer details (RNA6 - Name and Address Rowtype) and summary of total item such as apple, orange and pear bought daily (R320 - 32-day Rowtype) and monthly (R130 - 13-month Rowtype) for year 2014.
  • the summary in the eLedger are populated from the daily transactions in eFile.
  • Table 6.0 All Rowtype contains a Header with 8 columns and a Footer with 4 columns as shown
  • the row code (RWCD) of the Rowtype Header indicates its uniqueness among other same Rowtypes that appear within a Section and ledger (RWLG), account 1 (RWA1), account 2 (RWA2) and company & department (RWCO) indicates the location of the Rowtype in the database.
  • the security (RWSE) of the Rowtype Footer is used to ensure that the right user(s) can access this row and the checksum (RWCS) is to ensures the data within the row is not corrupted.
  • Subsequent Documents As illustrated in Figure 13, the creation of Subsequent Documents (subDoc), where the system splits a Doc so that it can be debited/credited to relevant account, each subDoc is appended as a string one after another.
  • the Main Doc and subDoc(s) will have the same document identifier.
  • an invoice with document identifier, D232 may have a subDoc to debit customer account and subDoc to credit Apple, Orange and Pear Stock. (Referring to the example in Figure 2).
  • the eFiles are stored in a RDBMS table, where the table comprises of Control, Index and Data.
  • the Control section contains key and details about the Page.
  • the Index is used to locate the location of each eDoc in a Page, where the Indexing are done in Horizontal manner to create sub-filing system within a filing
  • SUBSTITUTE SHEETS (RULE 26) system.
  • the Data is where the eFile is stored.
  • Each account contains a eFile and the eFile contains number of eDocs.
  • the eFile is chopped into Pages according to Page size before storing into RDBMS.
  • the Page number begins from Relative Page and when a new Page is added, the Relative Page is advanced to Page 1 and the Page number of the newly added Page is 0 and so forth. Besides that, Relative Page is also a relative page to the system; the enquiry will always start from Relative Page.
  • the Control section may also include the following:
  • the Transaction Processing will ensure that any eDocs that are to be stored into the
  • SUBSTITUTE SHEETS (RULE 26) database is Sarbanes-Oxley (SOX) compliance. This is achieved by making sure that the status of each storing and updating process is reported back to Transaction Processing; for this case, eDoc sequence number is used. The process is considered complete when the storing and updating at Transaction eFile and eLedger and Master eFile and eLedger are executed sucessfully.
  • SOX Sarbanes-Oxley
  • the Transaction Processing System used for Processing eDoc Transaction by receiving eDoc from a program (1001). Then, store received eDoc into Transaction eFile using Paging and Indexing Module (1002). Thereafter, update received eDoc to Transaction eLedger using Paging and Indexing Module (1003). Verify if Transaction eLedger updated successfully (1004). If received eDoc updated successfully, the system will store received eDoc into Master eFile using Paging and Indexing Module (1005). Then, update received eDoc to Master eLedger using Mapping Module (1006). Verify if Master eLedger updated successfully, go to step 1005; or else go to step 1008 to restart the process. Then, if Master eLedger updated successfully, the system returning the update status.
  • the master flow of the Data Mining Program is illustrated.
  • this module will trigger a Question Formulation Module.
  • the Question Formulation Module users can do enquiry on any data available in database (1101). From there, the Question Formulation Module will generate the enquiries based on the chosen information they have requested (1102). To process the enquiries, the Question Formulation Module will trigger a Parallel Processing Module to handle location assignment and control ledger of the server (1103). Later, each of location assigned will trigger the Data Processing Module which handles two tasks: (i) to search and retrieve requested eDoc(s) from database using Read Module and (ii) to parse the retrieved eDoc(s) and get the value from the
  • eFile consist of all transaction data made by an account
  • eLedger consist of the latest transaction data made by an account
  • eSummary consist of summary of data either from eFile or eLedger (1202).
  • the Parallel Processing module will assign location of the server. Firstly, the module will check whether the server at the location has responded or has not
  • SUBSTITUTE SHEETS (RULE 26) (1301). If the assigned server has not responded, the module will prompt an error message and end the process (1303). If there is a response from the server, the module will trigger the Data Processing module to further process the inquiry (1304). After each server has done processing the data, the module will compile and update the results from each server to control ledger (1305).
  • the module will check whether the account from input is null (1402). If the account is null, the module will search database on all account (1404). Then, if there is any specific account requested by users, program will search database on the specified account (1403). Later, by using a Read Module, to retrieve the requested document based on the specified ledger and repository by users (1405). And lastly, by using Retrieval Module, to retrieve the requested value based on the specified column name and parameter (1406).
  • the process flowchart of the Cumulative Summary module is illustrated. Firstly, users will pass input either to get the data from the ledger or to update the value inside the ledger (1501). If the mode is to update, then the module will locate the ledger based on ledger identifier from input (1503). Further, if the ledger exists, then the module gets the value from input and updates the value to the column specified in input (1504). This update process is also known as pigeon-holing because it summarizes and accumulates data from eFile or eLedger and update it to the specified column. It also marks the latest position of the file since the last time it processes the file.
  • this module will start accumulating everyday spending of that person at the beginning of the week until the end of the week from this person eFile or eLedger, and marks the last time this file has been processed (1505). If the mode is not update, then the module will locate the eSummary ledger based on the ledger identifier specified by users (1506). Later the module will retrieve the value from the column
  • the module will check whether the input passed is null (1602). If the input is not null, the module will print the input to the users' screen in the predefined output design (1603).
  • the predefined output design can be in the form of table, chart or graph. However, if the input is null, "No Result Retrieved" will be displayed on the users' screen (1604).
  • the system comprises a Question Formulation Module (1) for receiving search criteria and search filtering configurations from a user, a Parallel Processing Module (2) for establishing one or more communication links between a user server and one or more database servers, a Data Processing Module (3) configured to receive the search criteria and search filtering configurations of the user from the Question formulation Module, to generate filtering rules based on the received information, to apply the generated rules to retrieve related documents from the one or more database servers, and to extract specified information requested by the user from the retrieved documents; and a Display Output Module (6) for displaying the outputs of the extraction.
  • the Data Processing Module includes a Read Module (4) for retrieving the requested document based on the specified ledger and repository by users and a Retrieval Module (5) for retrieving the requested value based on the specified column name and parameter.
  • the legacy system In the legacy system, it must go through a ETL process and then loaded the data to a Data Warehouse before it is able to mine the data.
  • the eMS Data Mining is able to directly mine the data without going through the conventional process of mining data.
  • the Data Mining in eMS is simple, fast, and near real-time.
  • SUBSTITUTE SHEETS (RULE 26) advantageous of the eMS Data Mining over the legacy system data mining can be summarised as follow: (i) allows for multi-user to data mine data using eMS Account- centric file or ledger, (ii) all Business Data and File in Account-centric File, by customer, by Stock Code, by HR, by General Ledger (GL).
  • the Customer File will contain a complete chronological history of the documents e.g. their application, their invoice, payments etc. and can be used for Detail Analysis.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a method for mining data from documents stored in a Electronic Document (eDoc) format in real time without the need for going through a Extract, Transform, and Load (ETL) process, comprising the steps of: receiving, by a Question Formulation Module (1), search criteria and search filtering configurations from a user; establishing, by a Parallel Processing Module (2), one or more communication links between a user server and one or more database servers; transmitting, by the Question Formulation Module (1), the search criteria and search filtering configurations of the user to a Data Processing Module (3) for generating filtering rules based on the received information and applying the generated rules to retrieve related documents from the one or more database servers; extracting, by the Data Processing Module (3), specific information requested by the user from the retrieved documents; and displaying, by a Display Output Module (6), the outputs of the extraction.

Description

A METHOD FOR MINING ELECTRONIC DOCUMENTS AND SYSTEM
THEREOF
FIELD OF INVENTION
The invention relates to a method and system of mining data. More particularly, the invention relates to a method and system for mining electronic documents from a plurality of database servers in real time without the need for going through a Extract, Transform, and Load (ETL) process.
BACKGROUND OF THE INVENTION
The legacy system that uses RDBMS as its database cannot store documents and folios, the documents and folios must be split into tables with different primary keys and foreign keys to be able to store onto RDBMS. The thousands of table designs must include all input tables, intermediate tables and output tables is a very complicated undertaking. The thousands of tables complicate systems development life cycle (SDLC) & complicates data managements. Therefore, to provide a one view of customer folio, general ledger folio, stock folio, etc. is a difficult effort as the data must be transformed between each process (web, transmission, storage, processing (batch, online, BPM), data-warehousing, data-mining, output.
In Legacy system, it is difficult to integrate data & system because it is usually developed by different group as Legacy systems involve thousands of tables. The Documents, Flows, Business Rules & Codes are often lost in the process, therefore difficult for business personnel to talk to the computer people.
SUBSTITUTE SHEETS (RULE 26) Because of the complexity, software changes are slow, complex & expensive, the dateline are difficult to meet. What is really needed is an efficient system which stores structured, semi-structured and unstructured data (and the schema which describes the data), and manages the relationships between data items.
Consequently, a Emulating Manual System (eMS) is proposed to emulate manual filing system by storing and processing electronic document that operates on Relational Database Management System (RDBMS) and the invention provides a method and system for mining data of a Electronic Document (eDoc) format.
SUMMARY OF INVENTION The invention provides a method for mining data from documents stored in a Electronic Document (eDoc) format in real time without the need for going through a Extract, Transform, and Load (ETL) process, comprises the steps of: receiving, by a Question Formulation Module, search criteria and search filtering configurations from a user; establishing, by a Parallel Processing Module, one or more communication links between a user server and one or more database servers; transmitting, by the Question formulation Module, the search criteria and search filtering configurations of the user to a Data Processing Module for generating filtering rules based on the received information and applying the generated rules to retrieve related documents from the one or more database servers; extracting, by the Data Processing Module, specific information requested by the user from the retrieved documents; and displaying, by a Display Output Module, the outputs of the extraction. Preferably, the Data Processing Module includes a Read Module and a Retrieval Module.
In one embodiment of the invention, the method further comprises the steps of:
SUBSTITUTE SHEETS (RULE 26) accumulatively retrieving, by a Cumulative Summary Module, related documents and extracting specific information from the retrieved documents over a predetermined period of time; and generating, by the Cumulative Summary Module, a summary of the extracted information.
Preferably, the Question Formulation Module is configured to execute the instructions of: determining, by the user, search criteria and configurations; and transmitting the selected criteria to the Data Processing Module upon the establishment of the communication link between the user server and the database servers.
The Parallel Processing Module is configured to execute the instructions of: determining the database server for data mining assignment; checking the availabilities of the selected database servers; establishing communication links between the user server and the database servers; activating the Data Processing Module to receive search criteria and configurations from the Question Formulation Module; and compiling and updating the outcomes of each server to a control ledger.
The Read Module is configured to execute the instructions of: receiving inputs from the Question formulation Module; retrieving the documents based on the inputs of the user from the database servers; and activating the Retrieval Module to process the documents. Whereas the Retrieval Module is configured to execute the instructions of extracting specified information from retrieved documents based on the inputs of the user. The Cumulative Summary Module is configured to execute the instructions of: receiving predefined input consist of ledger identifier, column name, value and mode; checking if mode of the module is update mode; in the event of the module is in an update mode: locating a eSummary ledger based on the ledger identifier from the predefined input; and updating a specified column with the latest values if eSummary
SUBSTITUTE SHEETS (RULE 26) ledger is available; in the event of the module is not in an update mode, locating the eSummary ledger; and extracting the specified information from the specified column. At least one of the preceding objects is met, in whole or in part, by the invention, in which the embodiment of the invention discloses a system for inquiring or mining data from documents stored in a Electronic Document (eDoc) format in real time without the need for going through a Extract, Transform, and Load (ETL) process, comprises: a Question Formulation Module for receiving search criteria and search filtering configurations from a user; a Parallel Processing Module for establishing one or more communication links between a user server and one or more database servers; a Data Processing Module configured to receive the search criteria and search filtering configurations of the user from the Question formulation Module, to generate filtering rules based on the received information, to apply the generated rules to retrieve related documents from the one or more database servers, and to extract specified information requested by the user from the retrieved documents; and a Display Output Module for displaying the outputs of the extraction. Preferably, the Data Processing Module includes a Read Module and a Retrieval Module. Advantageously, the Question Formulation Module includes: a computer-executable instruction for generating a list of choices for the user to select, wherein the choices include information relating to department, repository, document, attribute, time, and parameter; and a computer-executable instruction for transmitting the selected criteria to the Data Processing Module upon the establishment of the communication link between the user server and the database servers.
The Parallel Processing Module includes: a computer-executable instructions for determining the database server for data mining assignment; a computer-executable instructions for checking the availabilities of the selected database servers; a
SUBSTITUTE SHEETS (RULE 26) computer-executable instruction for establishing communication links between the user server and the database servers; a computer-executable instruction for activating the Data Processing Module; and a computer-executable instruction for compiling and updating the outcomes of each server to a control ledger.
The Read Module includes: a computer-executable instruction for receiving inputs from the Question formulation Module; a computer-executable instruction for retrieving the documents based on the inputs of the user from the database servers; and a computer-executable instruction for activating the Retrieval Module to process the documents. Whereas the Retrieval Module includes a computer-executable instruction for extracting specified information from retrieved documents based on the inputs of the user.
In one embodiment of the invention, the system further includes a Cumulative Summary Module, the module comprises: a computer-executable instruction for accumulatively retrieving related documents and extracting information from the retrieved documents over a predetermined period of time; and a computer-executable instruction for generating a summary of the extracted information. Effectively, the invention provides a platform where users can search and view any data available in multiple databases and in its original form (a document). This is because of the uniqueness of the Emulating Manual System Database (eMS), an Account-Centric storage, in which it does not implement the legacy way of storing data (by separating data into segments). All data stored in eMS Database are in their original form (a document) and all of them are stored in only one database. In eMS data mining, it does not require ETL process to build a data warehouse because all data available represent a complete history. As a result, the eMS database itself can act as a Data Warehouse. Since the eMS database can act as a Data Warehouse, the users can through the platform do data mining in real-time and the platform can request for
SUBSTITUTE SHEETS (RULE 26) any information as long as the data is available anytime anywhere the users requested. Furthermore, information retrieves using this platform is SOX compliant. In other words, it can be proven that from which source that the retrieved data come from. One skilled in the art will readily appreciate that the invention is well adapted to carry out the objects and obtain the ends and advantages mentioned, as well as those inherent therein. The embodiments described herein are not intended as limitations on the scope of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
For the purpose of facilitating an understanding of the invention, there is illustrated in the accompanying drawing the preferred embodiments from an inspection of which when considered in connection with the following description, the invention, its construction and operation and many of its advantages would be readily understood and appreciated. illustrates a flow chart for creating a Electronic Document (eDoc) template string. illustrates a flow chart of extracting process from a Electronic Document (eDoc) String. illustrates a flow chart of retrieving process for data from the Column of extracted Electronic Document Header (eDoc-Header). illustrates a flow chart of updating process for data from the retrieved row index and column index.
SUBSTITUTE SHEETS (RULE 26) illustrates a flow chart of uploading process for a Electronic Document (eDoc) String. illustrates a flow chart of paging process of a Electronic Document (eDoc) for a Electronic File (eFile). illustrates a flow chart of indexing process of a Electronic Document (eDoc) for a Electronic File (eFile). illustrates a flow chart of reading process of a Electronic Document (eDoc) for a Electronic File (eFile). illustrates a flow chart of Mapping process of a Electronic Document (eDoc) for a Electronic File (eFile). illustrates an example of Electronic Dictionary (eDict) or metadata is used to describe the attribute/behavior in a string. illustrates an example eLedger containing details of a customer profile and item details. illustrates an example creation of subDoc based on the example as in Figure 11. illustrates an example of how eFiles store in a RDBMS Table, illustrates a Transaction Processing System.
SUBSTITUTE SHEETS (RULE 26) illustrates the master flowchart of Data Mining program. illustrates the process of Question Formulation module where users can do data inquiry. illustrates the process of parallel computing where it assigns the server location and handles the result retrieval from different server. illustrates Data Processing module to retrieve eDoc(s) from database and get the data from the retrieved eDoc(s). illustrates the process of data retrieval from or data update to cumulative summary ledger. illustrates the process of output display to user's screen, is a block diagram of the Data Mining system.
DETAILED DESCRIPTION OF THE INVENTION
The invention will now be described in greater detail, by way of example, with reference to the drawings.
Data is stored in a format called Electronic Document (eDoc), which serves as the display, storage, processing, and transmission format throughout the systems development life cycle, without transformation at any stage. Data can be imported from or exported to any format including PDF, XML, XLS and CSV.
SUBSTITUTE SHEETS (RULE 26) An Electronic File (eFile) stores eDocs (with all data file types) on a database (RDBMS). Filing System predominantly utilizes the database read, write and index functions only. Therefore it can utilise almost all popular RDBMS, and if necessary can handle any customised, in-house database systems.
As illustrated in Figure 1 , the system to emulate manual filing system for storing and processing document that operates on Relational Database Management System (RDBMS), comprising ; a String Template (1) having at least one details of document number, number of sections and number of rows defined based on at least one Input; a String Module (2) for generate a Electronic Document (eDoc) (11) having at least one Electronic Document Identifier (eDoc-Identifier), Section, Rowtype and Column by validating the document number, number of sections and number of rows based on the String Template (1); and a Extraction Module (3) for extracting the Electronic Document Identifier (eDoc-Identifier), Section, Rowtype and Column of Electronic Document (eDoc) (11) generated by the String Module (2) for retrieval process. The system also includes a Retrieval Module (4) for retrieving at least one Retrieved Data from the data of Electronic Document (eDoc) (11) stored in the database based on at least one Input of the Section, Rowtype and Column; a Updating Module (5) for updating the Retrieved Data of Electronic Document (eDoc) (11) and store at least one Updated Data to the database based on the Input of Section, Rowtype and Column defined; and a Formation Module (6) for forming the updated Electronic Document (eDoc) (11) by retrieving the Updated Data based on the Input of Section, Rowtype and Column. Further, the system has a Paging Module (7) for append Electronic Document (eDoc) (11) in the database into at least one Electronic File (eFile) (13) according to a predefined Page limit; a Indexing Module (8) for forming at least one Index to the Electronic File (eFile) (13) based-on document identifier, date, end sequence number, document status, document offset and document length; and a Read Module (9) for retrieving the Index and at least one Data Relative Page (Page 0) of the Electronic File (eFile) (13) based on at least one Read
SUBSTITUTE SHEETS (RULE 26) Input to at least one Output. In addition the system further includes a Mapping Module (10) for updating at least one Retrieved Data based on at least one Mapping Input by determining the Electronic File (eFile) (13) using the Read Module (9) to retrieve the Retrieved Data of Electronic Document (eDoc) (11) using the Retrieval Module (4), in which the Updating Module (5) update the Retrieved Data to the database and forming the Retrieved Data into the Electronic Document (eDoc) (11) using the Formation Module (6) for updating into at least one Electronic File (eFile) (13) using Paging Module (7) and forming at least one Index using the Indexing Module (8); and a Enquiry Module (14) for retrieving a pluralities of Electronic Document (eDoc) (11) information using a Read Module (10) based on at least one Information for the Electronic Document Identifier (eDoc-Identifier), Section, Rowtype and Column of Electronic Document (eDoc) (11), in which the retrieved Electronic Document (eDoc) (11) information having at least one file history display into at least one list form.
As illustrated in Figure 2, the system or module are initiated by creating a template by defining a new document and indicate number of sections and number of rows required, which will be defined by an input from a user or a database (101). Then, the system or module creates a Electronic Document (eDoc) based on the document number defined from the input (102). Thereafter, the system or module interprets the eDoc Identifier for further processing, where the system or module validates how many sections or Rowtype defined in the input to form a eDoc template string (103). If there is section defined in the input, the system or module will create a new section and define or classify (label) the section if it is the 1st Section in the input (106). Then the system or module validates is there any Rowtype defined in the input (107). If there is Rowtype defined in the input, the system or module will create a new Rowtype based on a Predefined Dictionary (108). The system or module further will validates if there is any other Section or Rowtype defined in the input for further processing (104,107). If there no other Section or Rowtype defined in the input, the
SUBSTITUTE SHEETS (RULE 26) system or module will further process to generate the eDoc template string with a document number (105).
As illustrated in Figure 3, the system or module are initiated by retrieving a Electronic Document (eDoc) String for extraction process (201), where the system extract a eDoc Identifier from the eDoc String having a document number (202). Then, the system will determine the section of the eDoc Identifier predefined in the eDoc String (203). If there is section defined in the eDoc Identifier (204), the system will split the section into Rowtype for further processing (205). Then, the Rowtype will split into column of data (207), where the column of data will be stored into a Database (208). The system further will validates if there is any other Section or Rowtype defined in the input for further processing (204,206). If there no other Section or Rowtype defined in the input, the system will further process for retrieval of data in the Column.
As illustrated in Figure 4, the system or module are initiated by receiving input from a database or a user (301). Then, the system validates the Section and Rowtype based on the receiving input for retrieval process (302,303). If valid section is determined, the system validates the Rowtype. If the valid Rowtype is determined, the system will locate the row index and column index (304). Then, the data is retrieved from the located row index and column index (305) and outputs the results for further processing (306).
As illustrated in Figure 5, the system or module are initiated by receiving input from a database or a user (401). Then, the system validates the Section and Rowtype based on the receiving input for updating process (402,403). If valid section is determined, the system validates the Rowtype. If the valid Rowtype is determined, the system will locate the row index and column index (404). Then, the data is updated to the located row index and column index based on the input received (405). Then, the updated
SUBSTITUTE SHEETS (RULE 26) data will be stored into a Database (406).
As illustrated in Figure 6, the system or module are initiated by retrieving the updated data stored in a Database for uploading process (501), where the system append the Electronic Document (eDoc) to output (502). Then, the systems will determine the number of section from the updated data stored in a Database to be assembled. If there is section defined in updated data, the systems will retrieve all the section from the updated data stored in a Database (503). Then, the system will retrieve all Rowtype for further processing (506). The system further will validates if there is any other Section or Rowtype defined in the updated data stored for further processing (504,507). If there no other Section or Rowtype defined in the input, the system will further proceed to append the retrieved Section, Rowtype and values (508) to be uploaded into a eDoc String (505). As illustrated in Figure 7, the system or module are initiated by receiving input from a database or a user, in which it contains information such as: ledger identifier, document identifier, account 1 and account 2 and eDoc (601). Then, the system validates with the database if this account is a new account (602). If it's not a new account, the system retrieves the existing Page from the database for later processing (603) and append eDoc form input to the eDoc from Page (604). However, if it's a new account, the system validate if the length of the combined eDoc is greater than the Page limit (605). If the length of the combined eDoc is greater than Page Limit, the system chops the combined eDoc into desired Pages according to predefined Page limit (608). Otherwise, each Index will be formed based-on document identifier, date, end sequence number, document status, document offset and document length (606). Then, store Page and Index into database (607).
As illustrated in Figure 8, the system or module are initiated by receiving input from a database or a user, in which it contains information such as: document identifier, date,
SUBSTITUTE SHEETS (RULE 26) end sequence no, document status, document offset and document length (701). Then, forming an Index by combining all input as a string and each input is separated by colon (:) (702) and return the formed Index to the system that triggered this operation (703).
As illustrated in Figure 9, the system or module are initiated by receiving input from a database or a user, in which it contains information such as: ledger identifier, document identifier, account 1 and account 2 (801). Then, the system retrieves Index (indexes) and DATA of Relative Page for a given or specified account from a eFile stored in the database (802). Thereafter, parse the Index into individual index for processing (803). Then, the system retrieves a index that contains document identifier or information from the input received (804). Then, the system validates if there matching indexes from the input received (805). If found, the system will extract the offset and the length of the target eDoc (806) and the system further extract the eDoc from DATA of Relative Page (807). Then, output the extracted eDoc to the database or user requested (808).
As illustrated in Figure 10, the system or module are initiated by receiving input from a database or a user, such as source or details of eDoc (901). Then, parse source eDoc for further processing (902). Then, identify and load destination eDoc (for later updating) (903). Then, loading predetermined mapping instructions (904). Then, the system validate if the data of the predetermined source column fulfill the predetermined requirement (907), if there are more mapping instruction. Then, perform computation on the data of the predetermined source column if there is one more mapping instruction (908). However, if there are no further mapping instruction, the system store the updated destination eDoc back into database (906). Thereafter, the system will sum up the result from the computation with the data of the predetermined destination column and update the final result to the predetermine destination column (909). This process will be carried on till there is no more
SUBSTITUTE SHEETS (RULE 26) mapping instruction (905) and store the updated destination eDoc back into database (906). eDoc Filing System account-centric system that acts as a display, transmission, storage and processing medium from end to end without requiring any other transformation or normalization.
Electronic File (eFile) is an electronic folio (similar to a file in conventional manual filing systems) where all types of documents with different data types can be stored together in an account-centric manner.
The Filing system logically stores all data and information that relate to a single account in an Electronic File (eFile), in chronological order. The Electronic Document (eDoc) are stored as sequential strings of data mapped to a data dictionary, and may include multiple data types in each string (e.g. image files, binary files, comma separated format, XML or any of the nearly 500 data formats in existence today). This allows the storage of any type of data within one record. The way eDoc stores its data provides near real-time data mining without the need for data modeling. eDoc is a data storage format comprising strings containing multiple rows each preceded by a unique row code: RxxV - Rxx being the row# and V the version#. Multiple rows of data of various rows make an eDoc. All data is stored in variable length or fixed length columns. Each row contains multiple columns separated by terminators. There are special terminators for start and end of DxxV (documents), RxxV (rows), etc. eDoc is designed for change. Various versions of RxxV and DxxV can exist concurrently. eDoc can be converted to XML and vice versa. eDoc is similar to XML as its data also has separators and identifiers and tags, but eDoc has additional system fields that provide new functionality. If required, XML is used as a universal transmission document and passed to other systems, where data can be
SUBSTITUTE SHEETS (RULE 26) normalized to tables. The table 1.0 and 2.0 further describes the terminators (separator) and identifiers and tags.
eDoc String
Example of eDoc String -Data Structure : (store in LxxV)
Figure imgf000017_0001
Terminators (separator) coding structure
Figure imgf000017_0002
SUBSTITUTE SHEETS (RULE 26)
Figure imgf000018_0002
Table 1.0
LDSRC coding structure
Figure imgf000018_0001
Table 2.0
The Document Identifier (such as RID0) will only contain one or the whole Document, in which the Document Identifier is stored in the first Section. The Document Identifier contains details such as creator details, document details, update history, attributes and etc. Furthermore, the eDoc String data structure is also an Nth-dimension data structure where another eDoc String can be encapsulated within the u[ ... u] and stored in a Column. The LDSRC Codes is also representing the GIS of an eDoc String stored. To retrieve the eDoc String, the LDSRC Codes are used to locate them.
SUBSTITUTE SHEETS (RULE 26)
Figure imgf000019_0001
SUBSTITUTE SHEETS (RULE 26)
Figure imgf000020_0001
eDict
As illustrated in Figure 11 , the Electronic Dictionary (eDict) or metadata is used to describe the attribute/behavior of each ledger (LxxV), document (DxxV) and Rowtype (RxxV). For LxxV level, the ledger identifier, eDoc updating methods (FIFO, LIFO, Update or Overwrite) and number of eDoc to be kept in eLedger is predefined in Ledger type eDict. For DxxV level, the document type to be or can be stored is predefined in the Document type eDict. For RxxV level, the Rowtype type eDict is categorized into 3 parts; first, general attributes such as name, data type, data length and so forth; second, display attributes such as font type, size, color and so forth; third, computation attributes like data validation and computation. The table 3.0, 4.0 and 5.0 shows an example of metadata or library predefined for Ledger, Document and Rowtype.
Ledger eDict - Definition
SUBSTITUTE SHEETS (RULE 26)
Figure imgf000021_0001
SUBSTITUTE SHEETS (RULE 26)
Figure imgf000022_0002
Table 3.0
Document eDict - Definition
Figure imgf000022_0001
SUBSTITUTE SHEETS (RULE 26)
Figure imgf000023_0001
SUBSTITUTE SHEETS (RULE 26)
Figure imgf000024_0002
Table 4.0
Rowtype eDict - Definition
Figure imgf000024_0001
SUBSTITUTE SHEETS (RULE 26)
Figure imgf000025_0001
Table 5.0
Example of eDict structure
SUBSTITUTE SHEETS (RULE 26)
Figure imgf000026_0001
Electronic Ledger (eLedger) is where summaries or derivatives of eFile that is kept in variable length or fixed length format thus allowing for greater flexibility and fast retrieval. The Information in eLedger can be deleted and modified. Each eFile can have multiple eLedgers if required (for speedy reporting purposes). The update method of each eDoc to the eLedger is predefined in eLedger dictionary. The eLedger can contain n copies of eDoc that arrange in FIFO or LIFO manner; or new eDoc can override the exiting eDoc in the eLedger; or the update only manipulate data from certain column(s) in eDoc with the predefine column(s) in eLedger. The system may further include Zero Balancing function where every transaction can be traced and no information is ever deleted, which means everything will be balanced (always balance to last cent). All transactions have a copy in the Transaction Ledger, so changes to any account are immediately verifiable and problems isolated. The system also may make the system naturally SOX Compliant (Sarbanes-Oxley Act of 2002). The system may
SUBSTITUTE SHEETS (RULE 26) further include Reverse Processing where a new eLedger can be generated or regenerated from eFile based on new configuration or updated configuration.
As illustrated in Figure 12, the eLedger contains example customer profile that includes customer details (RNA6 - Name and Address Rowtype) and summary of total item such as apple, orange and pear bought daily (R320 - 32-day Rowtype) and monthly (R130 - 13-month Rowtype) for year 2014. The summary in the eLedger are populated from the daily transactions in eFile.
Rowtype Header & Footer
Figure imgf000027_0001
Table 6.0 All Rowtype contains a Header with 8 columns and a Footer with 4 columns as shown
SUBSTITUTE SHEETS (RULE 26) on the Table 6 above. The row code (RWCD) of the Rowtype Header indicates its uniqueness among other same Rowtypes that appear within a Section and ledger (RWLG), account 1 (RWA1), account 2 (RWA2) and company & department (RWCO) indicates the location of the Rowtype in the database. The security (RWSE) of the Rowtype Footer is used to ensure that the right user(s) can access this row and the checksum (RWCS) is to ensures the data within the row is not corrupted.
Subsequent Documents (SubDoc) As illustrated in Figure 13, the creation of Subsequent Documents (subDoc), where the system splits a Doc so that it can be debited/credited to relevant account, each subDoc is appended as a string one after another. The Main Doc and subDoc(s) will have the same document identifier. For example, an invoice with document identifier, D232 may have a subDoc to debit customer account and subDoc to credit Apple, Orange and Pear Stock. (Referring to the example in Figure 2).
Reserve and Commit
It's a process where a set of predefined requirements have to be adhered before any updating can take place. For example in an invoice, the requirements will be the customer must have sufficient credit to be debited from the account and there must be sufficient stock to be stocked out before the process is committed.
Header + Index + Data
As illustrated in Figure 14, the eFiles are stored in a RDBMS table, where the table comprises of Control, Index and Data. The Control section contains key and details about the Page. The Index is used to locate the location of each eDoc in a Page, where the Indexing are done in Horizontal manner to create sub-filing system within a filing
SUBSTITUTE SHEETS (RULE 26) system. The Data is where the eFile is stored.
Figure imgf000029_0001
Each account contains a eFile and the eFile contains number of eDocs. The eFile is chopped into Pages according to Page size before storing into RDBMS. The Page number begins from Relative Page and when a new Page is added, the Relative Page is advanced to Page 1 and the Page number of the newly added Page is 0 and so forth. Besides that, Relative Page is also a relative page to the system; the enquiry will always start from Relative Page.
The Control section may also include the following:
lg - ledger identifier
acl - account 2
lpgn - last page no
ssq - start document sequence no
sin - start Page line no
esq - end document sequence no
eln - end Page line no
date - last updated date
st - the status of the eFile such as deleted
co - company and department
bal - balance of all eDocs
Transaction Processing
The Transaction Processing will ensure that any eDocs that are to be stored into the
SUBSTITUTE SHEETS (RULE 26) database is Sarbanes-Oxley (SOX) compliance. This is achieved by making sure that the status of each storing and updating process is reported back to Transaction Processing; for this case, eDoc sequence number is used. The process is considered complete when the storing and updating at Transaction eFile and eLedger and Master eFile and eLedger are executed sucessfully.
As illustrated in Figure 14, the Transaction Processing System used for Processing eDoc Transaction by receiving eDoc from a program (1001). Then, store received eDoc into Transaction eFile using Paging and Indexing Module (1002). Thereafter, update received eDoc to Transaction eLedger using Paging and Indexing Module (1003). Verify if Transaction eLedger updated successfully (1004). If received eDoc updated successfully, the system will store received eDoc into Master eFile using Paging and Indexing Module (1005). Then, update received eDoc to Master eLedger using Mapping Module (1006). Verify if Master eLedger updated successfully, go to step 1005; or else go to step 1008 to restart the process. Then, if Master eLedger updated successfully, the system returning the update status.
Data Mining Referring to figure 15, the master flow of the Data Mining Program is illustrated. When the Data Mining is triggered, this module will trigger a Question Formulation Module. In the Question Formulation Module, users can do enquiry on any data available in database (1101). From there, the Question Formulation Module will generate the enquiries based on the chosen information they have requested (1102). To process the enquiries, the Question Formulation Module will trigger a Parallel Processing Module to handle location assignment and control ledger of the server (1103). Later, each of location assigned will trigger the Data Processing Module which handles two tasks: (i) to search and retrieve requested eDoc(s) from database using Read Module and (ii) to parse the retrieved eDoc(s) and get the value from the
SU BSTITUTE SH EETS (RU LE 26) specified column using Retrieval Module (1104). Then the Data Processing Module will trigger a Cumulative Summary module to further process the enquiries (1105). Finally, all output will be sent to Display Output module to handle displaying the output (1106).
Referring to figure 16, the process flowchart of the Question Formulation Module is illustrated. Firstly, users can choose in which department they would like to enquire the data (1201). For example, if they are interested to know the sales in any particular month, they may want to choose Sales department or if they want to investigate customer information, they may want to opt for Customer department. Each of this department will be stored in one ledger. Next, from that department, they can choose whether to do enquiry in eFile, eLedger or eSummary. eFile consist of all transaction data made by an account, eLedger consist of the latest transaction data made by an account and eSummary consist of summary of data either from eFile or eLedger (1202). Next, users can choose which document they want to look into (1203). For instance, in Sales department, users may be interested in looking at Receipt document. If users want to only data mine on certain accounts, they can enter accounts name (1205). If more than one account, users have to enter one account name then followed by a comma (,) and the next account name. Later, users may choose to look only into some attribute or all attribute by choosing the attributes (1206). If users may want to specify any parameter on the chosen attribute, they can choose the symbol for the parameter e.g. '==', '>' then enter the parameter (1208, 1209). If they have more queries to run, they can start again from the beginning (1210). After all the input part, the module will trigger the Parallel Processing module to handle the searching on different server location and to collect the results from each server (1211).
Referring to figure 17, the process flowchart of the Parallel Processing Module is illustrated. The Parallel Processing module will assign location of the server. Firstly, the module will check whether the server at the location has responded or has not
SUBSTITUTE SHEETS (RULE 26) (1301). If the assigned server has not responded, the module will prompt an error message and end the process (1303). If there is a response from the server, the module will trigger the Data Processing module to further process the inquiry (1304). After each server has done processing the data, the module will compile and update the results from each server to control ledger (1305).
Referring to figure 18, the process flowchart of the Data Processing Module is illustrated. Firstly, the module will check whether the account from input is null (1402). If the account is null, the module will search database on all account (1404). Then, if there is any specific account requested by users, program will search database on the specified account (1403). Later, by using a Read Module, to retrieve the requested document based on the specified ledger and repository by users (1405). And lastly, by using Retrieval Module, to retrieve the requested value based on the specified column name and parameter (1406).
Referring to figure 19, the process flowchart of the Cumulative Summary module is illustrated. Firstly, users will pass input either to get the data from the ledger or to update the value inside the ledger (1501). If the mode is to update, then the module will locate the ledger based on ledger identifier from input (1503). Further, if the ledger exists, then the module gets the value from input and updates the value to the column specified in input (1504). This update process is also known as pigeon-holing because it summarizes and accumulates data from eFile or eLedger and update it to the specified column. It also marks the latest position of the file since the last time it processes the file. For instance, if the summary is about the total spending of a person weekly, then this module will start accumulating everyday spending of that person at the beginning of the week until the end of the week from this person eFile or eLedger, and marks the last time this file has been processed (1505). If the mode is not update, then the module will locate the eSummary ledger based on the ledger identifier specified by users (1506). Later the module will retrieve the value from the column
SUBSTITUTE SHEETS (RULE 26) name specified in input (1507). Finally, the module will trigger Display Output module for handling output display (1508).
Referring to figure 20, the process flowchart of the display module is illustrated. First, the module will check whether the input passed is null (1602). If the input is not null, the module will print the input to the users' screen in the predefined output design (1603). The predefined output design can be in the form of table, chart or graph. However, if the input is null, "No Result Retrieved" will be displayed on the users' screen (1604).
Referring to figure 21, the system for mining data is illustrated. The system comprises a Question Formulation Module (1) for receiving search criteria and search filtering configurations from a user, a Parallel Processing Module (2) for establishing one or more communication links between a user server and one or more database servers, a Data Processing Module (3) configured to receive the search criteria and search filtering configurations of the user from the Question formulation Module, to generate filtering rules based on the received information, to apply the generated rules to retrieve related documents from the one or more database servers, and to extract specified information requested by the user from the retrieved documents; and a Display Output Module (6) for displaying the outputs of the extraction. Preferably, the Data Processing Module includes a Read Module (4) for retrieving the requested document based on the specified ledger and repository by users and a Retrieval Module (5) for retrieving the requested value based on the specified column name and parameter.
In the legacy system, it must go through a ETL process and then loaded the data to a Data Warehouse before it is able to mine the data. Whereas the eMS Data Mining is able to directly mine the data without going through the conventional process of mining data. Hence, the Data Mining in eMS is simple, fast, and near real-time. The
SUBSTITUTE SHEETS (RULE 26) advantageous of the eMS Data Mining over the legacy system data mining can be summarised as follow: (i) allows for multi-user to data mine data using eMS Account- centric file or ledger, (ii) all Business Data and File in Account-centric File, by customer, by Stock Code, by HR, by General Ledger (GL). The Customer File will contain a complete chronological history of the documents e.g. their application, their invoice, payments etc. and can be used for Detail Analysis. Customers are also linked by group for group analysis, (iii) account-centric Customer File is useful by postcode, and other personal detail, (iv) each Salesman can access his customer details but NOT save a copy because eMS can provide detail info, (v) processing can be distributed and fast, (vi) from the Analysis Data can use different presentation tools. The files are constantly updated and Data Mining can be done in real-time.
SUBSTITUTE SHEETS (RULE 26)

Claims

1. A method for mining data from documents stored in a Electronic Document (eDoc) format in real time without the need for going through a Extract, Transform, and Load (ETL) process, comprising the steps of:
receiving, by a Question Formulation Module (1), search criteria and search filtering configurations from a user;
establishing, by a Parallel Processing Module (2), one or more communication links between a user server and one or more database servers;
transmitting, by the Question Formulation Module (1), the search criteria and search filtering configurations of the user to a Data Processing Module (3) for generating filtering rules based on the received information and applying the generated rules to retrieve related documents from the one or more database servers;
extracting, by the Data Processing Module (3), specific information requested by the user from the retrieved documents; and
displaying, by a Display Output Module (6), the outputs of the extraction.
2. A method according to claim 1, further comprising the steps of:
accumulatively retrieving, by a Cumulative Summary Module, related documents and extracting specific information from the retrieved documents over a predetermined period of time; and
generating, by the Cumulative Summary Module, a summary of the extracted information.
3. A method according to claim 1 or claim 2, wherein the Question Formulation Module (1) is configured to execute the instructions of:
determining, by the user, search criteria and configurations; and
transmitting the selected criteria to the Data Processing Module (3) upon the establishment of the communication link between the user server and the database
SUBSTITUTE SHEETS (RULE 26) servers.
4. A method according to any one of claims 1 to 3, wherein the Parallel Processing Module (2) is configured to execute the instructions of:
determining the database server for data mining assignment;
checking the availabilities of the selected database servers;
establishing communication links between the user server and the database servers; activating the Data Processing Module (3) to receive search criteria and configurations from the Question Formulation Module (1); and
compiling and updating the outcomes of each server to a control ledger.
5. A method according to any one of claims 1 to 4, wherein the Data Processing Module (3) includes a Read Module (4) and a Retrieval Module (5).
6. A method according to claim 5, wherein the Read Module (4) is configured to execute the instructions of:
receiving inputs from the Question Formulation Module (1);
retrieving the documents based on the inputs of the user from the database servers; and
activating the Retrieval Module (5) to process the documents.
7. A method according to claim 5 and claim 6, wherein the Retrieval Module (5) is configured to execute the instructions of extracting specified information from retrieved documents based on the inputs of the user.
8. A method according to any one of claims 1 to 7, wherein the Cumulative Summary Module is configured to execute the instructions of:
receiving predefined input consist of ledger identifier, column name, value and mode; checking if mode of the module is update mode;
SUBSTITUTE SHEETS (RULE 26) in the event of the module is in an update mode:
locating a eSummary ledger based on the ledger identifier from the predefined input; and
updating a specified column with the latest values if eSummary ledger is available; in the event of the module is not in an update mode:
locating the eSummary ledger; and
extracting the specified information from the specified column.
9. A system for inquiring or mining data from documents stored in a Electronic Document (eDoc) format in real time without the need for going through a Extract,
Transform, and Load (ETL) process, comprising:
a Question Formulation Module (1) for receiving search criteria and search filtering configurations from a user;
a Parallel Processing Module (2) for establishing one or more communication links between a user server and one or more database servers;
a Data Processing Module (3) configured to receive the search criteria and search filtering configurations of the user from the Question Formulation Module (1), to generate filtering rules based on the received information, to apply the generated rules to retrieve related documents from the one or more database servers, and to extract specified information requested by the user from the retrieved documents; and a Display Output Module (6) for displaying the outputs of the extraction.
10. A system according to claim 9, wherein the Question Formulation Module (1) comprises:
a computer-executable instruction for generating a list of choices for the user to select, wherein the choices include information relating to department, repository, document, attribute, and parameter; and
a computer-executable instruction for transmitting the selected criteria to the Data Processing Module (3) upon the establishment of the communication link between the
SUBSTITUTE SHEETS (RULE 26) user server and the database servers.
11. A system according to claim 9 or claim 10, wherein the Parallel Processing Module (2) comprises:
a computer-executable instructions for determining the database server for data mining assignment;
a computer-executable instructions for checking the availabilities of the selected database servers;
a computer-executable instruction for establishing communication links between the user server and the database servers;
a computer-executable instruction for activating the Data Processing Module (3); and a computer-executable instruction for compiling and updating the outcomes of each server to a control ledger.
12. A system according to any one of claims 9 to 11, wherein the Data Processing Module (3) includes a Read Module (4) and a Retrieval Module (5).
13. A system according to claim 12, wherein the Read Module (4) comprises:
a computer-executable instruction for receiving inputs from the Question Formulation Module (1);
a computer-executable instruction for retrieving the documents based on the inputs of the user from the database servers; and
a computer-executable instruction for activating the Retrieval Module (5) to process the documents.
14. A system according to claim 12 and claim 13, wherein the Retrieval Module (5) comprises a computer-executable instruction for extracting specified information from retrieved documents based on the inputs of the user.
SUBSTITUTE SHEETS (RULE 26)
15. A system according to any one of claims 9 to 14, further comprising a Cumulative Summary Module, the module comprises:
a computer-executable instruction for accumulatively retrieving related documents and extracting information from the retrieved documents over a predetermined period of time; and
a computer-executable instruction for generating a summary of the extracted information.
SUBSTITUTE SHEETS (RULE 26)
PCT/MY2015/050126 2014-10-13 2015-10-13 A method for mining electronic documents and system thereof WO2016060551A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2014703016 2014-10-13
MYPI2014703016 2014-10-13

Publications (1)

Publication Number Publication Date
WO2016060551A1 true WO2016060551A1 (en) 2016-04-21

Family

ID=55746991

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2015/050126 WO2016060551A1 (en) 2014-10-13 2015-10-13 A method for mining electronic documents and system thereof

Country Status (1)

Country Link
WO (1) WO2016060551A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157788A (en) * 2021-04-13 2021-07-23 福州外语外贸学院 Big data mining method and system
CN113743082A (en) * 2021-09-16 2021-12-03 京东科技信息技术有限公司 Data processing method, system, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010037332A1 (en) * 2000-04-27 2001-11-01 Todd Miller Method and system for retrieving search results from multiple disparate databases
US20070106671A1 (en) * 2005-11-08 2007-05-10 Fujitsu Limited Computer-readable recording medium storing data collection program and data collection apparatus
US20100198881A1 (en) * 2007-03-02 2010-08-05 E-Manual System Sdn. Bhd. Method of data storage and management

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010037332A1 (en) * 2000-04-27 2001-11-01 Todd Miller Method and system for retrieving search results from multiple disparate databases
US20070106671A1 (en) * 2005-11-08 2007-05-10 Fujitsu Limited Computer-readable recording medium storing data collection program and data collection apparatus
US20100198881A1 (en) * 2007-03-02 2010-08-05 E-Manual System Sdn. Bhd. Method of data storage and management

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157788A (en) * 2021-04-13 2021-07-23 福州外语外贸学院 Big data mining method and system
CN113157788B (en) * 2021-04-13 2024-02-13 福州外语外贸学院 Big data mining method and system
CN113743082A (en) * 2021-09-16 2021-12-03 京东科技信息技术有限公司 Data processing method, system, storage medium and electronic equipment
CN113743082B (en) * 2021-09-16 2024-04-05 京东科技信息技术有限公司 Data processing method, system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US20190332606A1 (en) A system and method for processing big data using electronic document and electronic file-based system that operates on RDBMS
US8396894B2 (en) Integrated repository of structured and unstructured data
US9009201B2 (en) Extended database search
US20170228356A1 (en) System Generator Module for Electronic Document and Electronic File
WO2016060547A1 (en) Emulating manual system of filing using electronic document and electronic file
US20170039286A1 (en) Integrated data mining system architecture for extraction, processing and consumption of user data for customizing search engine output and other applications
JP2008511936A (en) Method and system for semantic identification in a data system
CN108052668A (en) The endowed method and system of intelligence based on commodity code
CN106296385A (en) A kind of book keeping operation section purpose arranges and recommends method
US20170235757A1 (en) Electronic processing system for electronic document and electronic file
Walker Python Data Cleaning Cookbook: Modern techniques and Python tools to detect and remove dirty data and extract key insights
US20170235727A1 (en) Electronic Filing System for Electronic Document and Electronic File
WO2016060551A1 (en) A method for mining electronic documents and system thereof
Uvidia Fassler et al. Moving towards a methodology employing knowledge discovery in databases to assist in decision making regarding academic placement and student admissions for universities
US8504552B2 (en) Query based paging through a collection of values
US20170235747A1 (en) Electronic Document and Electronic File
JP4024267B2 (en) Supplier guidelines system
CN115048456A (en) User label generation method and device, computer equipment and readable storage medium
CN117813601A (en) System and method for enabling relevant data to be extracted from multiple documents
CN109408704B (en) Fund data association method, system, computer device and storage medium
CN113344674A (en) Product recommendation method, device, equipment and storage medium based on user purchasing power
JP4927150B2 (en) Trade settlement related data management system and method
US9208224B2 (en) Business content hierarchy
US9069812B1 (en) Method and apparatus for assembling a business document
WO2021024966A1 (en) Company similarity calculation server and company similarity calculation method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15850638

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15850638

Country of ref document: EP

Kind code of ref document: A1