GB2579564A - System and method for generating secure electronic documents - Google Patents

System and method for generating secure electronic documents Download PDF

Info

Publication number
GB2579564A
GB2579564A GB1819701.2A GB201819701A GB2579564A GB 2579564 A GB2579564 A GB 2579564A GB 201819701 A GB201819701 A GB 201819701A GB 2579564 A GB2579564 A GB 2579564A
Authority
GB
United Kingdom
Prior art keywords
data
file
payslip
pdf
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1819701.2A
Other versions
GB201819701D0 (en
Inventor
Cox Charles
Hancock Stephen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Smith & Ouzman Ltd
Original Assignee
Smith & Ouzman Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smith & Ouzman Ltd filed Critical Smith & Ouzman Ltd
Priority to GB1819701.2A priority Critical patent/GB2579564A/en
Publication of GB201819701D0 publication Critical patent/GB201819701D0/en
Publication of GB2579564A publication Critical patent/GB2579564A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Economics (AREA)
  • Computer Hardware Design (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Bioethics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Means of document generation comprising the steps of inputting at least one data input in a first data format 71, storing mapping templates 80, matching each data input to a mapping template 80, generating a formattable data object for each data input according to the matched mapping template. Said means of document generation further comprises outputting the or each formattable data object, converting the formattable data objects into a second data format 82, and outputting data in the second data format 85. The outputted data may be secure, a payslip, or financial report. X and Y axis locations, font size, or data font may be used to match data to the mapping template. The first data input may be a PDF, spreadsheet, delimited file, fixed length file, or XML. The second data format may be in paper, braille, audio, or electronically displayed formats. This invention aims to reduce the chance of human error while changing document formats by mapping relevant data fields across formats.

Description

System and Method for Generating Secure Electronic Documents The present invention relates to a system and method for generating secure electronic documents. In particular, the present invention relates to a system and method for generating electronic payslips.
It is known to provide digital payslips to employees. Accountancy software providers, such as SAGETM offer tools for generation of payslips, but the format of the payslip output is fixed, so that the design, structure and branding of the payslip cannot be altered by a user. Furthermore, the data content of the payslip that is output is not the primary purpose of the tool, such that the payslips output is restricted by the data that is required for accountancy purposes. It has been found that existing tools do not allow a user to vary the format or content of their payslips and that existing payslip generation systems are overly restrictive. If a user wishes to modify the content and format of payslips, this would often require manual changes to be made, which increases the risk of human error and is a security risk given that payslips contain personal data that must be protected. It has been found that solutions allowing a user to create bespoke payslips are labour intensive and there are technical problems in ensuring that payslips can be generated securely and accurately. Payslips are required to be generated frequently and accurate and timely production is of upmost importance. For computer-implemented generation of secure electronic documents, this means that failure or delay in document output must be avoided. It has been found that existing tools do not allow a user to control the output format of their payslips efficiently and any changes made to the input or output format affect the reliability of the generation process.
The present invention sets out to alleviate the problems described above by providing an improved system and method for generation of secure electronic documents, which is both efficient and reliable.
In one aspect, the present invention provides a method of document generation comprising the steps of: inputting at least one data input in a first data format; storing at least one mapping template; matching each data input to a mapping template; generating a formattable data object for each data input according to the matched mapping template; outputting the or each formattable data object; converting the or each formattable data object into a second data format; and outputting data in the second data format.
It is understood that the mapping template is a virtual representation of the relative position and expected alphanumeric content of the data input.
Preferably, the first data format of the or each data input is any one or more of a portable document format (pdf); a spreadsheet; a delimited data file; a line listing print file; a fixed length data file; or a tagged file format, such as XML.
Preferably, the second data format is output as any one or more of a paper document; a braille document; an audio output; a display on an electronic device.
Preferably, the data output in the second data format is a financial report and/or a payslip.
The method of the present invention also allows for generation of reports using the individual data fields.
Preferably, the data is output according to an output template.
Preferably, the data output is secure.
Preferably, the data output is encrypted as individual data fields.
The present invention allows for secure extraction of individual elements of data or data fields.
Preferably, at least one data field is stored as a hashed value.
By storing the individual data field as a hashed value/s, the present invention enables fast secure searching.
Preferably, the at least one data input is arranged in data fields.
Preferably, at least one data field of the at least one data input is a unique identifier.
The method of the present invention allows for rapid and accurate generation of documents with a reduced processing load meaning that the risk of failure is significantly reduced. The data extraction of the present invention is structured such that the method is fast and secure. Furthermore, the targeted solution of the present invention requires much less storage and is more reliable, allowing location, extraction and secure saving of each element of the payslip.
The present invention offers a significant improvement for accurate and reliable generation of secure documents, particularly of electronic documents such as payslips. The method of the present invention is a flexible solution allowing the user to manage the output format of the electronic document regardless of the format of the input from which data is extracted. The present invention is reliable in ensuring that personally identifiable information (PII) is securely handled without the risk of failure.
Preferably, the or each data input is matched to the mapping template using any one or more of the data location on the x-axis; data location on the y-axis; data font; data font size.
Preferably, the or each data input is matched to the mapping template using the alphanumeric content of the data.
In a further aspect the present invention provides a system for document generation comprising: an input for receiving at least one data input in a first data format; at least one mapping template; a processor for matching each data input to a mapping template; an output for outputting a formattable data output for each data input according to the matched mapping template, wherein the data output is in a second data format and the second data format is different from the first data format.
Preferably, the system further comprises a display for displaying the formattable data output.
Preferably, the display is any of a personal computer, tablet, or smart phone.
Within this specification embodiments have been described in a way which enables a clear and concise specification to be written, but it is intended and will be appreciated that embodiments may be variously combined or separated without parting from the invention. For example, it will be appreciated that all preferred features described herein are applicable to all aspects of the invention described herein and vice versa.
Detailed Description
These and other characteristics of the present technology will be more fully understood by reference to the following detailed description in conjunction with the attached drawings, in which: Figure 1 is a flow diagram showing an overview of the inputs and outputs of the present invention; Figure 2 is a flow diagram showing the data extraction process from multiple, single page pdf files in an example of the present invention; Figure 3 is a flow diagram showing the data extraction process from a multiple-page pdf file containing multiple pay slips in a further example of the present invention; Figure 4 is a flow diagram showing the method of data extraction from a single pdf file containing more than one payslip to a page in a further example of the present invention; Figure 5 is flow diagram showing the processing steps for a batch of pdf single page files in an example of the present invention; Figure 6 is a flow diagram showing the processing steps for a multiple-page pdf file containing multiple payslips in an example of the present invention; Figure 7 is a flow diagram showing the processing steps for data supplied in the form of a spreadsheet; Figure 8 is a flow diagram showing the processing steps for data supplied in line listing format; Figure 9 is a flow diagram showing the processing steps for data supplied as a fixed length data file; Figure 10 is a flow diagram showing the processing steps for data supplied in a tagged file format; for example, in XML format; Figure 11 is a flow diagram showing the processing steps for data supplied in a delimited format, such as csv; Figure 12 is a flow diagram showing the validation and saving steps of the method of the present invention; Figure 13 is a schematic overview of data output in accordance with the present invention.
Referring to Figure 1, an overview of the possible data input and data outputs of the system and method of the present invention is shown. The system of the present invention allows the input of data in multiple formats. In the example shown in Figure 1, data is input as any one of a "portable document format ("pdf") 1; a data file 2; a spreadsheet 3; a delimited data file 4; a line listing print file 5, or a tagged file 6. For each format, the method of the present invention analyses the data available to identify the data that needs to be extracted. The data is extracted at step 7 from each file format and stored in a payslip data store 8. The data is partially encrypted to ensure the security of the data stored.
There are various possible output formats; for example, the data can be output as a paper written report or payslip 9, which can be presented using regular sized print or large print. Other possible output formats include an audio file 10; a paper braille report 11; or a software application on a mobile device 13. The mobile device can be a mobile 'phone, a smartphone; a tablet; or other mobile electronic device 13. Alternatively, the data output is sent to a secure website, from which data can be viewed on a visual display unit, for example on a tablet 14, or personal computer 15, or similar device.
Referring to Figure 2, the generation of payslips from multiple, individual pdf files is shown. In the example shown, the multiple pdf files each have a different format. The method starts with uploading of multiple pdfs at step 20. A batch number is allocated to the whole set of uploaded files at step 21 before the pdf file is validated at step 22. The method checks, at step 23, if the pdf file has been validated. That is, the system checks whether each file is a properly formatted pdf file. If the pdf file is not validated, at step 24, the system asks whether it is possible to ascertain employee details from the file; for example the method extracts useful information from the file name; such as the employee number or a National Insurance number or other unique identifier, to identify the employee. If it is not possible to ascertain the employee details from the uploaded file, at step 25, the issue of the invalid pdf is logged, and the pdf file is rejected. The logging of the invalid pdf is reported; for example on a website list of rejections. The invalid pdf will not be useable by the system and will need to be replaced.
If, at step 24, it is possible to ascertain the employee details from the file, the further identifying data that has been gathered is also logged at step 26, in addition to logging the issue of the invalid pdf; the file/pdf is rejected. For example, the system checks whether an employee unique identifier can be extracted from the file. The system then asks, at step 27, whether there are more pdfs to process. If there are more pdfs to process, the method returns to step 22 to validate the pdf files and proceed through step 23 to step 27, until the method reaches step 27 and there are no more pdfs to process.
When there are no more pdfs to process, the method proceeds to step 28 to ask if the system has processed at least one valid pdf file. If it is determined that no valid pdf files have been processed, the method stops at step 30. However, if at step 28, the system has processed at least one valid pdf, the method proceeds to step 31 to flag the batch of files as awaiting passing; that is, to await further processing.
The method and system of the present invention is suitable for processing a variety of input pdfs into a payslip output or similar financial report. Multiple pre-determined pdf formats are validated according to pre-set maps or mapping templates. However, it is understood that in alternative embodiments of the present invention the mapping of the data is not taken from pre-set templates but is automatically extracted. The automatic generation of a mapping template is achieved by using common terms to represent fields in the payslip. The automatically-generated map is then ascertained by locating expected field terms within the pdf or data file. In one embodiment of the automatic map generation steps, a separate file; for example, a spreadsheet is supplied in the form of a transcript of the pdf payslip, and the transcript is used to locate the matching fields within the pdf payslip or data file.
In the embodiments of Figures 2, 3 and 4, the pre-set maps allow the system to locate and examine text objects, i.e. alphanumeric content. Within the pdf file the pre-set maps are arranged and recorded according to the field that is to be located. The field is located according to several properties, including: location on the x-axis; location on the y-axis; font; font size; and alphanumeric content. For example, the system of the present invention uses the x, y co-ordinates stored in the pre-set map to compare and extract data from the input pdf. The required fields are located, identified and then recorded, with optional encryption, to be processed into a bespoke payslip. The output format of the payslip is also mapped and stored as a template.
Referring to Figure 2, at step 23, if the pdf is validated, the system initialises a new payslip object at step 40 and a lookup field counter is set to 1, at step 41. At step 42, the system locates the field on the individual page of the pdf file according to the x and y co-ordinates that are mapped in the database for the field counter. The system checks whether, using the mapped co-ordinates, the corresponding field on the page has been found. At step 43, if the corresponding field has been found, the method proceeds to store the field data in the payslip object at step 44. For each file that has been validated, the system works through the pre-set template/map to identify each text object field that is included in the map to read and store each piece of data that is required. If the field is not found, then a null value is recorded for later checking.
The system asks at step 45 if the system has read all the fields as expected from the mapping and incrementally increases the lookup field counter at step 46 to repeat the location and storage of field data through steps 42 to 46 until all expected files have been read as expected from the mapping. The method then proceeds to step 47 and validates and saves the payslip. The steps of validation and saving the payslip are described in more detail with respect to Figure 12. For example, validation includes a check that all required fields have been found and that all identified fields are of the format specified. For example, for a UK payslip, it is common to identify the National Insurance Number and this number must be validated to be in the correct format.
After validating and saving the payslip, the system asks whether there are more pdfs to process, at step 48. If there are no more pdfs to process, at step 31, the system flags the batch as awaiting passing. If there are more pdfs to process, the method returns to validate the pdf file, at step 22. At step 31, when the system flags the batch as awaiting passing, the method stops at step 30.
When the last pdf file has been processed, the database record for that batch is updated to record the number of payslip files processed, the number of payslip files rejected, and the number of payslips rejected. The system of the present invention validates and saves the payslips at step 47 by saving extracted data and not saving a copy of the pdf file itself. This represents a significant improvement because the mapping of the present invention is a structured method to extract data and ensure that individual elements of data are secure. This is particularly advantageous in the field of the present invention where secure and efficient extraction is of upmost importance. The data is extracted as per a separate mapping and is then encrypted as individual elements, or data fields. In addition, certain fields are stored as hashed values to enable fast and secure searching. This would not be possible if the payslip were stored as a pdf. The data extraction of the present invention enables the system to produce reports using extracted data. Report generation would not be possible from a payslip stored as a pdf. For example, the present invention allows for searching of data to report all payslips that have a net value about a certain threshold value. It would not be possible to search a pdf payslip in this way. Furthermore, the data extraction method of the present invention also allows for more efficient data storage than would be possible for payslips stored as a pdf. The individually extracted elements of data of the present invention require much less storage space than the original pdf payslip documents.
Referring to Figure 3, the method of the present invention is described in respect of payslips presented as a single pdf containing multiple payslips, with each payslip on a separate page and having a different output format. In the embodiment described, the layout of the text in the pdf document would have been pre-determined and a mapping template configured in advance to map the fields from the pdf input to the output payslip. However, it is understood that in alternative embodiments of the present invention the mapping of the data is not taken from pre-set mapping templates but is automatically extracted.
At step 50, the method starts and at step Si., uploads the multiple page pdf with a batch number allocated to the uploaded file at step 52. At step 53, the system asks whether the entire pdf file is validated; that is, whether the whole file is a properly formatted pdf. If the pdf is successfully validated at step 54, the method proceeds to initialise a new payslip object at step 57. However, if the pdf is not successfully validated the issue is logged and the file is added to a list to be recorded as a reject before the method ends at step 56.
Following the initialising of a new payslip object at step 57, the method sets the lookup field counter to 1. At step 59, the system locates the field on the page as identified by the x and y co-ordinates stored in the mapping template. The method asks at step 60 whether a corresponding field is identified at those co-ordinates. If a corresponding field is identified, the data is stored in the payslip object at step 62. If no corresponding field is identified the method moves directly to ask whether, at step 63, the system has read all the fields as expected from the mapping. If the system has not read all of the fields as expected from the mapping template, at step 64, the lookup field counter is incrementally increased before returning to step 59 and locating the field on the page according to the mapping template. The system uses the mapping template to locate all data stored in memory for each field that matches a field of the mapping template. If a field is matched, the text is read from the pdf for that field and stored in memory for later validation. If a field is not found, a null value is stored in memory for that field ready for later validation. When the last field in the mapping template has been searched for and recorded, a check is made to ensure that all required fields have been found and that all fields have the required format.
At step 63, when the system has read all the fields expected from the mapping template, the method moves to validate and save the payslip data. The method improves efficiency and reliability by validating and saving the extracted data only and does not re-save the entire pdf file.
Thus, data processing is much faster than for data saved in a pdf file. Furthermore, there is an additional benefit that the data storage requirement is significantly reduced. The present invention allows the system to operate more efficiently and also allows data fields to be searchable; for example for report generation. Each field of data that has been extracted from the pdf file is saved in a corresponding field of the formattable data object as defined by the mapping template. Validation and saving of the extracted data is described in more detail with respect to Figure 12. If, at step 65 the system determines that all fields do not pass validation, the payslip record is added to the rejected payslip list.
At step 66, the method then asks whether there are more pages to process. If there are more pages to process, the method proceeds to incrementally increase the page counter at step 67 before repeating steps 57 to 66, until the system determines that there are no more pages to process. At this stage, the system updates the database record for the pdf batch to record the number of payslips that have been processed and the number of payslips that have been rejected. At step, 66, when it is found that there are no more pages to process, the system proceeds to flag the file as awaiting passing at step 68, which indicates that the batch is ready for proof reading and checking, and the method stops at step 69.
Referring to Figure 4, the method of the present invention is described with respect to a multiple-page, single pdf that contains multiple payslips per pdf page. In the example shown, each pdf page has a different output format. In the embodiment described, the layout of the text in the pdf document would have been pre-determined and a mapping template configured to map the fields from the pdf input to the formattable data object for later conversion into an output payslip. However, it is understood that in alternative embodiments of the present invention the mapping of the data is not taken from pre-set templates but is automatically extracted.
Referring to Figure 4, the method starts at step 70 and at step 71, the multiple page pdf is uploaded. At step 72, a batch number is allocated to the uploaded file and, at step 73, the whole file is validated to ensure that it is a properly formatted pdf file. If the system determines, at step 74, that the pdf is not valid, at step 75, the file is recorded as an invalid file and added to a list of rejected files. The method would then stop at step 76.
If the pdf is validated at step 74, the method proceeds to step 77 to initialise offsets. That is, when processing a pdf that includes multiple payslips on a single page, the present invention refers to the mapping to derive how many payslips are expected on each pdf page and whether the top Y co-ordinate is for each of those payslips. The offset enables the system to record the position of the first payslip on the pdf page. The mapping template is configured to locate the data to extract from each payslip. The method of the present invention uses the same mapping template for each payslip on the page and moves from one payslip to the next according to the offset amount that is recorded. The offset amount corresponds to the y co-ordinate distance between payslips. For each payslip of the multiple payslips per page, the system at step 78 initialises a new payslip object before, at step 79, setting the lookup field counter to 1. At step 80, the method uses a pre-determined mapping template to iterate through each text object field included in the template, using coordinates saved therein, to locate the associated field on the pdf page by matching the co-ordinates of the mapping template to the same coordinates on the pdf page. It is understood that the mapping template can use x-coordinates, y-coordinates, data font; data font size; and/or the alphanumeric content of the data to match the template to the data input. If, at step 81, the system finds a corresponding field on the pdf page, the data found is read and stored as a payslip object at step 82. If, at step 81, no corresponding field is found on the page, a null value is recorded for later validation. In the embodiment described, the layout of the text in the pdf document would have been pre-determined and a mapping template configured to map the fields from the pdf input to the data object output; for example to be converted into an output payslip. However, it is understood that in alternative embodiments of the present invention the mapping of the data is not taken from pre-set templates but is automatically extracted.
When the last field of the mapping template has been searched for and recorded, the method proceeds, at step 83, to ask whether all expected fields from the mapping template have been read. The system also checks whether the data that has been read and recorded is in the correct format. If the format is incorrect, the data is added to the rejected payslip list for later checking. If not all fields have been read, the lookup field counter is incrementally increased at step 84; before the method repeats steps 80 to 83 until all fields expected from the mapping template have been read. The system then validates and saves the payslip data at step 85. The method improves efficiency and reliability by validating and saving the extracted data only and does not re-save the entire pdf file. Each field of data that has been extracted from the pdf file is saved in a corresponding database field as defined by the mapping template. Validation and saving of the data is described in more detail with respect to Figure 12. If, at step 85, the system determines that all fields do not pass validation, the payslip record is added to the rejected payslip list.
At step 86, the system asks whether there are more payslips on the page and, if the answer is yes, the method proceeds to step 87 to offset the y co-ordinate of the mapping template to the next payslip position before initialising a new payslip object at step 78. The system uses the offset recorded at step 77 to move to the next data object on the same page. The method then repeats steps 79 to 85 to ensure that data is also extracted from any further data objects/payslips on the page. This is repeated until, at step 88, it is determined that there are no more pages to process. At step 89, the system then flags the file as awaiting passing before the method stops at step 90.
Referring to Figure 5, in an alternative embodiment of the method and system of the present invention, multiple, single-page pdf payslips are processed to be made available to the employee without alteration. For example, an employer may wish to make an employee's pay history available through a mobile application or through an e-payslip website. This embodiment of the present invention allows for prior-issued payslips to be made available unchanged and ensures that all payslips are associated with the correct employee.
The method of Figure 5 starts at step 100 and multiple pdfs are uploaded at step 101. Batch numbers are allocated to the whole set of the uploaded files at step 102 before the system validates the pdf files at step 103. The system checks whether the pdf validates at step 104 to ensure that the files are properly formatted. If the pdf does not validate, the system checks at step 105 whether it is possible to ascertain the employee details -i.e. an employee unique identifier from the file. At step 106, if it is not possible to ascertain the employee details from the file, the issue is logged, and the file is added to the list of rejects before checking whether there are more pdfs to process. If it is possible to ascertain the employee details from the file, these details are recorded together with the invalidity of the pdf and the payslip is added to a list of rejects at step 108.
Following both steps 106 and 108, the system checks whether there are more pdfs to process, at step 107. If there are no more pdfs to process, the system checks whether at least one valid pdf has been processed, at step 117. If no valid pdf has been processed, the method stops at step 120. If at least one valid pdf has been processed, the system flags the file as awaiting passing, which indicates that the batch is ready for proof reading and checking, and the method stops at step 120. If at step 107, the system detects that there are more pdfs to process, the method returns to step 103 to validate a pdf file.
At step 104, the pdf is validated before a new data object is initialised at step 109. A mapping template is used to set the lookup field counter to the employee identifier field. At step 111, the employee identifier field is located on the page by locating the x, y co-ordinates recorded in the mapping template. The system asks at step 112 if the corresponding field on the single page pdf has been identified. If the field has not been identified, at step 113 the issue is logged, and the payslip is added to a list of rejects. If, at step 112, a corresponding field is found on the page of the pdf, the payslip object is updated with the employee identifier and the pdf file name. The pdf file is then saved in the database at step 115. A record is also inserted into the database for the payslip at step 116.
The method steps 103 to 116 are repeated until, at step 118 the system identifies that there are no more pdfs to process. The method then proceeds to step 119 to flag the last file as awaiting passing, which indicates that the batch is ready for proof reading and checking, and the method stops at step 120. The system also records the number of payslip files that have been processed, the number of payslip files that have been rejected and the number of payslips rejected.
Referring to Figure 6, in a further alternative embodiment of the present invention a multiple page, pdf document containing multiple individual payslips is processed to present payslips as an extracted but unchanged pdf document. The method of the present invention ensures that the extracted payslips are associated with the correct employee. The method starts at step 130 and at step 131 uploads the multiple page pdf file. At step 132 a batch number is allocated to the uploaded file before the system validates the pdf file at step 133 to check that the pdf file is a properly formatted pdf file. If the pdf does not validate at step 134, the system logs the issue at step 135 and adds the file to a list of rejects before the method stops at step 136.
At step 134, if the pdf file is validated the system sets the lookup field to the employee identifier field at step 137 before initialising a new payslip/data object at step 138. At step 139, the system uses a mapping template to locate the required field on the page using the x, y coordinates mapped in a database for the field counter. The mapping template may also use data font; data font size; and/or the alphanumeric content of the data to match the mapping template to the input data. The method asks at step 140 whether the employee identifier was found on the page. If the employee identifier was not found, the issue is logged and the file is added to a list of rejects at step 149. The system moves on to step 145 to ascertain if there are more payslips to process. At step 140, if the employee identifier was found on the page, the payslip object is updated with the employee identifier and the pdf file name, at step 141.
The system then extracts the payslip page to its own pdf, at step 142 and the pdf is saved in the database at step 143. At step 144, the payslip record is inserted into the database for the payslip. The system then asks at step 145 whether there are more pages to process. If there are more pages of the multiple page pdf to process, then the page counter is incrementally increased at step 146 and a new payslip object is initialised at step 138. The method repeats steps 138 to 145 until there are no more pages to process. The system then flags the file as awaiting passing, which indicates that the batch is ready for proof reading and checking, and the method stops at step 148.
Referring to Figures 1 and 7, the method and system of the present invention also accepts data supplied in spreadsheet format. At step 150 the method starts, and the spreadsheet is then uploaded at step 151. At step 152, a batch number is allocated to the uploaded file before the spreadsheet file is validated at step 153. The extracted data is validated according to a prior agreed data format. If the system determines that the spreadsheet has not been validated, the issue is logged at step 155, and the file is added to a list of rejects before the method stops at step 156.
If the spreadsheet is validated at step 154, the method proceeds to initialise a new payslip object at step 157 before a row counter is set according to the first data row at step 158. The system asks, at step 159, if data was found in that row. If no data was found in the row, the method asks at step 160 whether there are more rows to process. The system of the present invention systematically reads each row of data to identify, extract and store all required data. If there are no more rows to process at step 160, the method proceeds to ask if there are more worksheets to process at step 167. If there are no more worksheets to process the file is flagged as awaiting passing at step 169. If at step 160, the system method indicates that there are more rows to process, the row counter is incrementally increased at step 161 before returning to step 159 to ask if data was found in that row.
If at step 159, data is found in that row, the system reads all the data from the row by matching the row of data to a mapping template in step 162. The mapping template for a spreadsheet is pre-determined to map the rows and columns of data in the spreadsheet and identify the data fields required for payslip generation. The system then stores the field data in a payslip object at step 163 before validating and saving the payslip at step 164, which is described in more detail with respect to Figure 12. The method then asks if there are more rows to process at step 165. If there are no more rows to process the method moves to ask, at step 167, whether there are more worksheets in the spreadsheet to process. If there are no more worksheets to process the file is flagged as awaiting passing at step 169.
If at step 165, the system detects that there are more rows of the spreadsheet to process, the row counter/index is incrementally increased at step 166 before the method continues through steps 159 to 164 after asking whether any data was found in that row. This process is repeated until all rows with data have been read and data extracted and stored in the payslip object.
If, at step 167, the system detects that there are more worksheets to process, the worksheet counter/index is incrementally increased at step 168 before a new payslip object is initialised at step 157 and the method returns to step 158 with the row counter set to the first data row before proceeding through steps 159 to 167 as previously described. When it is determined that there are no more rows and no more worksheets to process, all required files are flagged as awaiting passing at step 169, which indicates that the batch is ready for proof reading and checking, and the method stops at step 170.
Referring to Figures 1 and 8, for older payslip systems the payslip data is stored as a line listing (print file format), which is typically printed using a band or dot matrix printer. The system and method of the present invention can map and extract data from a line listing as described with respect to Figure 8.
The method of converting line listing data starts at step 190 before the line listing is uploaded at step 191. A batch number is allocated to the uploaded file at step 192 and each row of text is read. The system proceeds to validate the uploaded file at step 193 to check that the data is in the correct format and that all required fields have been found within the page. If the file is not validated at step 194, the system logs the issue and adds the file to a list of rejects at step 195 before the method stops at step 196.
If, at step 194, the file is validated the method proceeds to set the line index to 1, at step 197. A new payslip object is initialised at step 198 and the system reads the line of text corresponding to the current index line, at step 199. The system consults the mapping template to identify the field counters mapped to the current index row, at step 200. At step 201, the system then updates the payslip object fields for each field identified from the mapping template in the current row of the line listing. Each field in the print file corresponding to the fields of the mapping template is saved.
At step 202, the method checks whether all expected fields have been read from mapping template. If not all expected fields have been read, then the line index is incrementally increased before steps 199 to 202 are repeated. If, at step 202 all expected fields have been read from the mapping template, the method proceeds to step 204 to validate and save the payslip. The method then checks at step 205 whether there are more pages to process. If there are more pages to process, the page counter is incrementally increased at step 206 before the method repeats steps 197 to 205 for a new line and payslip object. If, at step 205, the system notes that there are no more pages to process the file is flagged as awaiting passing at step 207. This indicates that the batch is ready for proof reading and checking, and the method stops at step 208. The database record is also updated for the batch of line listing data to record the number of payslips processed and the number of payslips rejected.
Referring to Figures 1 and 9, the system and method of the present invention is suitable for converting fixed length data files to an agreed data output, for example as an electronic payslip. A fixed length data file is understood to be a fixed unit of memory/number of bytes and each field within that data is a fixed unit of memory/number of bytes. The method and system of the present invention reads the data file and extracts data as required.
Referring to Figure 9, there is shown a method of processing a payslip file that is supplied as a fixed length data file format. The method starts at step 220 and the fixed length data file is uploaded at step 222. The payslip fixed length data file that has been uploaded is given a batch number at step 223 and is then processed systematically by reading each record. At step 224 the fixed length file is validated to check that the data is in the expected format. The method also checks that all required fields have been found within the record. At step 225 if the file is not validated, the method proceeds to log the issue and add the invalid payslip record to a reject list before the method ends at step 227.
At step 225 if the file is validated, the method proceeds to set the record index to 1 at step 228 and initialise a new payslip object at step 229. The fixed length data file is then matched to a mapping template to identify the required fields for extraction. At step 230, the extracted data is recorded at the previously set index and, at step 231, is input into the corresponding field of the payslip object as defined in the mapping template.
The method proceeds to check at step 232 whether all fields set out in the mapping template have been read from the validated fixed length file. If not all fields have been read, the record index is incrementally increased at step 233. The method then initialises a new payslip object and steps 229 to 232 are repeated until all expected fields that are set out in the mapping template have been read. The method then proceeds to validate and save the payslip object at step 234, which is described in more detail with respect to Figure 12.
The method asks whether there are more records to process at step 235. If there are more records to process, the method returns to step 233 to incrementally increase the record index and repeat steps 229 to 234 until there are no more records to process at step 235. The system then flags the file as awaiting passing at step 236. This indicates that the batch is ready for proof reading and checking, and the method stops at step 237. The database record is also updated for the batch of line listing data to record the number of payslips processed and the number of payslips rejected.
Referring to Figures 1 and 10, the system of the present invention is also used to extract data from tagged files, such as XML or HTML.
Referring to Figure 10, the method starts at step 250 and the tagged file is uploaded at step 251. A batch number is allocated to the uploaded file at step 252 and the file is processed by systematically reading each payslip element. At step 253, the uploaded file is then parsed before being validated at step 254. The system validates the uploaded file by checking against the required data format and a check is made to ensure that all required fields have been found. If the data format or fields are not as required, the tagged file is not validated at step 254 and the issue is logged, and the file added to a list of rejects at step 255 before the method stops at step 256.
At step 254, if the uploaded tagged file is validated, the method proceeds to step 257 to initialise a new payslip object and at step 258 to locate the payslip start tag. The mapping for the tagged file will include a reference to a defined start tag, which can be in a variety of formats. The method checks at step 259 whether the start tag has been found. If the start tag has not been found the system checks whether it is at the end of the file. If no start tag has been found and the file is not at the end, the issue is logged at step 261 and the file is added to a list of rejects because the tagged file is detected to not be well formed. At step 260, if we are at the end of the file, then the file is flagged as awaiting passing at step 262.
If, at step 259, the start tag has been found, the method proceeds to step 263 to read through the tagged file elements and match the file elements to the mapping template to extract the field data and store the data in the payslip object until the end tag of the payslip file is reached. When the end tag is reached the method proceeds to step 264 to validate and save the payslip data. The XML tags and attributes have previously been saved during the initial mapping set-up process, mapped to the payslip object fields and stored in the configuration settings for the file template. Thus, the method is able to produce payslips in multiple languages. By way of example, within a payslip a tag is set up called "Net Pay" with an attribute "Net Pay Display". In an English-language payslip file, the value of that attribute would be "Net Pay" but in a French-language payslip the value would be "Salaire Net". In addition to the "Net Pay Display" attribute, the tag would also have an element, for example "Amount" that will have a value for the Net Pay to be displayed.
After validation and saving of the payslip, which is described in more detail with respect to Figure 12, the system checks if the end of the file has been reached at step 265. If the end of the file has not been reached the method returns to step 257 to initialise a new payslip object and repeat steps 258 to 266 until the end of the file is reached and the file is awaiting passing. This indicates that the batch is ready for proof reading and checking, and the method stops at step 267. The database record is also updated for the batch of tagged files to record the number of payslips processed and the number of payslips rejected.
Referring to Figures 1 and 11, the system of the present invention is also used to extract data from delimited data files.
Figure 11 shows the method of processing payslip data supplied in a delimited file format.
The method starts at step 280 and the delimited file is uploaded at step 281. The uploaded file is allocated a batch number at step 282 before the delimited file is validated at step 283. The method validates the delimited file by checking the format of the data and the fields on each payslip. If the delimited file is not validated, at step 285 the system proceeds to log the issue and add the file to a list of rejects before the methods stops at step 286.
If, at step 284, the file is validated the method initialises a new payslip object at step 287 before a row counter is set to the first data row at step 288. The system asks whether data is found in that row at step 289. If no data is found, the method checks whether there are more rows to process in step 290. If no more rows are identified for processing, the method proceeds to step 293 to flag the file as awaiting passing. If the system identifies at step 290 that there are more rows to process, the row counter is incrementally increased at step 291 before checking if data is found in that row at step 289.
At step 289 when data is found in the row, the system matches the data found to the mapping template and reads all in the row in step 294. The field data identified by matching to the mapping template is stored in the pay slip/data object at step 295. The payslip/data object is then validated and saved at step 296, which is described in more detail with respect to Figure 12. The method then checks at step 297 if there are more rows to process. If there are more rows to process the method proceeds to incrementally increase the row counter at step 298. The steps 289 to 296 are then repeated until, at step 297, there are no more rows to process and the file is flagged, at step 293 as being awaiting passing. This indicates that the batch is ready for proof reading and checking, and the method stops at step 299. The database record is also updated for the batch of tagged files to record the number of payslips processed and the number of payslips rejected.
Referring to Figure 12, the validation of the extracted payslip object is described in more detail. The format of the payslip/data object is known to the system due to previous data mapping that enables the system to know what fields will be on the payslip object and where they will be. The expected fields vary between mapping templates, which allows the system of the present invention to adapt to user requirements. Data mapping is created for each mapping template configured for the system, which allows for differences between payslip objects that are output. As previously described with respect to Figures 2 to 11, only fields that have been matched to a mapping template are extracted from the input file and stored within the payslip/data object. It is understood that in alternative embodiments of the present invention, the mapping of the data is not taken from pre-set templates but is automatically extracted. The automatic generation of a mapping template is achieved using common terms to represent fields in the payslip. The automatically-generated map is then ascertained by locating expected field terms within the pdf or data file. In one possible embodiment of the automatic map generation, a separate file for example, a spreadsheet is supplied in the form of a transcript of the pdf payslip, and the transcript is used to locate the matching fields within the pdf payslip or data file.
The validation process starts at step 300 and at step 301 the field index is set to 1. The system then checks the look up field specification in the database. The database is checked to see if the fields of the payslip object are defined in the required format. If the field is in the required format the method proceeds to step 304 to ask whether the supplied field data meets the required specification. If the supplied fields do not match the required specification the system records the issue with the field at step 305. If the supplied field does match the required specification, the field counter is incrementally increased at step 306. The method then asks whether the expected end of the field list has been reached. If the expected end of the field list has not been reached the field index look up is incrementally increased at step 314 and the method returns to look up the field specification in the database, repeated steps 302 to 307 until the end of the expected field list is reached.
At step 307 when the end of the expected field list has been reached, the method proceeds to check whether the inspected field counter equals the number of expected fields. If the inspected field counter does not equal the expected fields at step 308, the method checks whether we have an employee identifier for the payslip at step 309. If there is an employee identifier then this is recorded at step 312 with the rejected payslip and the system updates the fields in the database for that current payslip at step 313 before returning the payslip object as "not validated", which is understood to be recorded as "false" for the validation. After indicating that the payslip object is not validated the method proceeds as previously described with respect to Figures 2 to 11.
At step 308, if the inspected field counter is equal to the number of expected fields, the system checks at step 320 whether any fields were recorded as having an issue. If any issues with the payslip object have previously been recorded the method proceeds to step 309 to ask whether there is an employee identifier for this payslip. If there is no employee identifier, the payslip object is recorded as rejected and returned as not validated (steps 310 and 311) before the method proceeds as previously described with respect to Figures 2 to 11.
At step 308, if the inspected field counter is equal to the number of expected fields and no issues were recorded and noted at step 320, the system updates all fields in the database for the current payslip object at step 321 and the payslip object is returned as validated or "true" at step 322 to proceed as previously described with respect to Figures 2 to 11.
Referring to Figure 13, the system of the present invention takes the validated and saved payslip object data, as previously described with respect to Figures 1 to 12 and processes the data from a range of input formats to output data in a format required by a user. The payslip data, at step 400 can be output at step 401 as a paper document, including as a braille paper document, or as an audio output. If required, the payslip data can be output via a firewall 401 to the internet 403 to be accessed via a mobile application 404. The mobile application 404 can be of any format that is required by a user. Alternatively, the payslip data can be accessed by an "e-payslip" website 405, where the payslip object data is presented in a payslip or as a report. The system and method of the present invention allows for the payslip object data to be presented in any format required and the output can be re-formatted later and supplied to a user in different formats and layout; for example, to be presented as a report or a payslip or a data summary.
It should be understood that various changes and modifications to the presently preferred embodiments described herein will be apparent to those skilled in the art. Such changes and modifications can be made without departing from the spirit and scope of the present invention and without diminishing its attendant advantages. It is therefore intended that such changes and modifications are covered by the appended claims.
GB1819701.2A 2018-12-03 2018-12-03 System and method for generating secure electronic documents Withdrawn GB2579564A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1819701.2A GB2579564A (en) 2018-12-03 2018-12-03 System and method for generating secure electronic documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1819701.2A GB2579564A (en) 2018-12-03 2018-12-03 System and method for generating secure electronic documents

Publications (2)

Publication Number Publication Date
GB201819701D0 GB201819701D0 (en) 2019-01-16
GB2579564A true GB2579564A (en) 2020-07-01

Family

ID=65024786

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1819701.2A Withdrawn GB2579564A (en) 2018-12-03 2018-12-03 System and method for generating secure electronic documents

Country Status (1)

Country Link
GB (1) GB2579564A (en)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None *

Also Published As

Publication number Publication date
GB201819701D0 (en) 2019-01-16

Similar Documents

Publication Publication Date Title
US8140468B2 (en) Systems and methods to extract data automatically from a composite electronic document
US8019769B2 (en) System and method for determining valid citation patterns in electronic documents
CN112016273B (en) Document catalog generation method, device, electronic equipment and readable storage medium
US9002838B2 (en) Distributed capture system for use with a legacy enterprise content management system
US9256798B2 (en) Document alteration based on native text analysis and OCR
US10178248B2 (en) Computing device for generating a document by combining content data with form data
KR101979322B1 (en) Electronic document braille translation system and a method therefor
US9454545B2 (en) Automated field position linking of indexed data to digital images
US20180131834A1 (en) Image filing method
GB2487600A (en) System for extracting data from an electronic document
JP6786658B2 (en) Document reading system
US10643022B2 (en) PDF extraction with text-based key
KR102126342B1 (en) Electronic document braille translation system and a method therefor
GB2579564A (en) System and method for generating secure electronic documents
US11281901B2 (en) Document extraction system and method
CN113033177B (en) Method and device for analyzing electronic medical record data
EP2927824A1 (en) Computer-implemented system and method for indexing electronic documents
US12008138B1 (en) Method for maintaining privacy and security of data
JP7377565B2 (en) Drawing search device, drawing database construction device, drawing search system, drawing search method, and program
JP7468960B1 (en) Information processing device, information processing method, and information processing program
US20240184985A1 (en) Information representation structure analysis device, and information representation structure analysis method
CN112732948B (en) Identity verification method, device and storage medium
WO2023047570A1 (en) Information processing device, information processing method, and information processing program
CN110457659B (en) Clause document generation method and terminal equipment
CN114186549A (en) Docx document service processing and data utilization system and method

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)