CN111027285B - Method and system for automatically extracting order information from pdf format order - Google Patents

Method and system for automatically extracting order information from pdf format order Download PDF

Info

Publication number
CN111027285B
CN111027285B CN201911297269.XA CN201911297269A CN111027285B CN 111027285 B CN111027285 B CN 111027285B CN 201911297269 A CN201911297269 A CN 201911297269A CN 111027285 B CN111027285 B CN 111027285B
Authority
CN
China
Prior art keywords
order
information
page
file
pdf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911297269.XA
Other languages
Chinese (zh)
Other versions
CN111027285A (en
Inventor
曾振环
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Upstream Software Co ltd
Original Assignee
Nanjing Upstream Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Upstream Software Co ltd filed Critical Nanjing Upstream Software Co ltd
Priority to CN201911297269.XA priority Critical patent/CN111027285B/en
Publication of CN111027285A publication Critical patent/CN111027285A/en
Application granted granted Critical
Publication of CN111027285B publication Critical patent/CN111027285B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to the technical field of pdf document editing technology and text regular processing, and discloses a method for automatically extracting order information from an order file in pdf format, which comprises the following steps: analyzing a client order file in pdf format; the paging and blocking information sequenced according to the character string positions is combined into a plain text file line by line; and capturing and extracting key information of the order from the combined plain text file. According to the method and the system for automatically extracting the order information from the pdf format order, when a foreign trade company operator imports the order information in the pdf format file of a client into a foreign trade information management database of the company, the detailed order information in the pdf file format sent by the client can be automatically extracted and imported into the company database through programming, so that the working efficiency of the operator when using the foreign trade information management system is greatly improved, a large amount of time is saved, and the user experience is greatly improved.

Description

Method and system for automatically extracting order information from pdf format order
Technical Field
The invention relates to the technical field of pdf document editing technology and text regular processing, in particular to a method and a system for automatically extracting order information from pdf format orders.
Background
The pdf format file is a file format widely adopted internationally, and because the file content is not freely editable, has strong universality and high standardization, the overseas clients in the trade activities of enterprises, especially in the foreign trade field, generally adopt the pdf format file to transmit important information such as order form and the like. In a foreign trade company or enterprise, a salesman is usually required to input the detailed order information in the pdf format sent by a client through an Email attachment to a foreign trade information management system of the company, and because of the particularity of the format structure of the pdf file, the link is usually manually input and extracted page by page, line by line and word segment by using a keyboard in the past, which is time-consuming, labor-consuming and easy to make mistakes. Therefore, a method and a system capable of automatically extracting and importing the detailed order information of the pdf file format sent by the client into the database of the company are comprehensively developed aiming at the structural characteristics of the pdf file format and combining the information characteristics of the order file, so that the working efficiency of a business operator in using a foreign trade information management system can be greatly improved, a large amount of time is saved, and the user experience is greatly improved.
Disclosure of Invention
The invention provides a method and a system for automatically extracting order information from pdf format order, which can automatically extract order key information from pdf format order files sent by clients and automatically import the order key information into a database in foreign trade information management of a company, can remarkably improve the working efficiency of foreign trade operators and improve the user experience, and solve the problems that the conventional link is usually manually input and extracted page by page and line by word segment by using a keyboard, thereby not only wasting time and labor, but also being easy to make mistakes.
The invention provides a method for automatically extracting order information from an order file in pdf format, which comprises the following steps:
s1, analyzing a customer order file in a pdf format to obtain paging block information which is ordered according to character string positions;
s2, merging the paging and blocking information sequenced according to the character string positions line by line into a plain text file;
s3, according to the characteristics of pdf file information in the customer order, regular expression programming is adopted, and key information of the order is captured and extracted from the combined plain text file.
Preferably, the parsing the pdf format of the client order file in step S1 includes the following steps:
s101, analyzing pdf customer order files page by page, and searching Tj or TJ labels from the pdf customer order files;
s102, acquiring character string content and position information of the character string content from Tj or TJ tags;
s103, analyzing the pdf customer order file page by page, searching for the l or re label from the pdf customer order file, and obtaining position information of a drawn line or a drawn rectangle;
s104, according to the positions of a plurality of drawn lines or a plurality of rectangles, the position range of the form block in the order file is synthesized;
s105, comparing and judging whether the character strings obtained from the Tj or TJ labels belong to the character strings in the table according to the position range of the table blocks, and dividing the character strings in each page into two types according to the character strings, wherein one type belongs to the table blocks and the other type does not belong to the table blocks.
Preferably, the merging into the plain text file in step S2 includes the following steps:
s201, dividing the character strings which do not belong to the table blocks in each page into blocks according to positions and sequencing the blocks row by row;
s202, integrating the character strings belonging to the table blocks in each page into a table expressed by a plurality of rows of character strings according to the shape of the rows and the columns of the table, and calculating the initial row position of the table;
s203, inserting the whole table into the non-table character string in rows according to the row position sequence in each page, and combining to form a plain text page;
s204, merging and outputting each plain text page into a plain text file according to the page sequence.
Preferably, each row of information in the table expressed in a number of rows of strings is expressed in a row of strings, and column information of the table is expressed in a column of strings of fixed length and column spacers.
Preferably, in step S3, the characteristics of the pdf file information in the customer order are classified into the following two types:
non-form order key information, according to format characteristics of a customer describing order information in the pdf file, adopting regular expression programming corresponding to the order key information format, capturing and extracting corresponding order key information from the combined plain text file;
and the form order key information adopts regular expression programming corresponding to the form information format according to the format characteristics of the client describing the order information in the pdf file, and captures and extracts the corresponding order key information in the form from the merged plain text file.
A system for automatically extracting order information from an order file in pdf format, comprising:
the analysis module is used for analyzing the customer order file in the pdf format to obtain paging block information which is ordered according to the character string position;
the merging module is used for merging the paging block information sequenced according to the character string positions into a plain text file line by line;
and the capturing module is used for capturing and extracting order key information from the combined plain text file by adopting regular expression programming according to the characteristics of pdf file information in the customer order.
Preferably, the parsing module includes:
the character analysis module is used for analyzing the pdf file page by page, searching Tj or TJ labels from the pdf file page by page, and acquiring character string contents and position information of the character string contents from the Tj or TJ labels;
the line drawing analysis module is used for analyzing the pdf file page by page, searching l or re labels from the pdf file, acquiring the position information of the line drawing or rectangle drawing, and synthesizing the position range of the form blocks in the order file according to the positions of a plurality of line drawing or rectangles;
and the table analysis module is used for comparing and judging whether the character strings acquired from the Tj or TJ labels belong to the character strings in the table according to the position range of the table blocks, and dividing the character strings in each page into two types according to the character strings, wherein one type belongs to the table blocks and the other type does not belong to the table blocks.
Preferably, the merging module includes:
the single page merging module is used for dividing the character strings which do not belong to the table blocks in each page according to the positions and sequencing the character strings line by line; integrating the character strings belonging to the table blocks in each page into a table expressed by a plurality of rows of character strings according to the shape of the rows and the columns of the table, and calculating the initial row position of the table; inserting the whole form into the non-form character string in rows according to the row position sequence in each page, and combining to form a plain text page;
and the multi-page merging module is used for merging and outputting each plain text page into a plain text file according to the page sequence.
Preferably, the capturing module includes:
the non-form capturing module is used for capturing and extracting corresponding order key information from the combined plain text file by adopting regular expression programming corresponding to the format of the order key information according to the format characteristics of the client describing the order information in the pdf file;
the table capturing module is used for capturing and extracting the corresponding order key information in the table from the merged plain text file by adopting regular expression programming corresponding to the format of the table information according to the format characteristics of the client describing the order information in the pdf file.
The invention has the following beneficial effects:
according to the method and the system for automatically extracting the order information from the pdf format order, when a foreign trade company operator imports the order information in the pdf format file of a client into a foreign trade information management database of the company, a manual mode of inputting the past page-by-page, line-by-line and field-by-field is changed, and the detailed order information in the pdf file format sent by the client can be automatically extracted and imported into the company database through programming, so that the working efficiency of the business operator when using the foreign trade information management system is greatly improved, a great amount of time is saved, and the user experience is greatly improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a schematic diagram of a system structure according to the present invention.
Description of the embodiments
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-2, a method for automatically extracting order information from an order file in pdf format includes the steps of:
s1, analyzing a customer order file in a pdf format to obtain paging block information which is ordered according to character string positions;
s2, merging the paging and blocking information sequenced according to the character string positions line by line into a plain text file;
s3, according to the characteristics of pdf file information in the customer order, regular expression programming is adopted, and key information of the order is captured and extracted from the combined plain text file.
In step S3, if Order Number (Order Number) non-table Order key information in the customer Order is to be extracted, the following regular expression (1) may be used for programming processing:
Order Number:\s+?(\d{7}) (1)
in the regular expression above:
"Order Number" means that the extracted Order information character string must be preceded by an identification of "Order Number";
"\s+" means more than 1 invisible character, such as a space;
"? "means that 1 or more invisible characters may or may not be present;
"d {7}" means a character string composed of 7-digit numbers;
by "()" is meant that a character string that meets the condition description in small brackets is captured and used as a return value.
According to regular expression (1), the character strings in the merged plain text file that can match this condition are the return values after capturing the extraction via regular expression (1).
In step S3, if the table order key information in the extracted order page is to be captured, the following regular expression (2) may be used for programming:
(\w+)\t(\w+)\t([\d\/]+)\t(\d+)\t(\w+)\t(\w+)\t([\d\/]+)\t(\d+)\t(\w+)\t(\w+)\t([\d\/]+)\t(\d+) (2)
in the regular expression above:
"\w" means text characters;
"\t" means tab editor;
"\d" means a numeric character;
"+" means 1 or more;
"[ ]" is meant to include characters of the conditional description in brackets;
the "/" meaning a slash character "/".
In the technical scheme, the step S1 of analyzing the client order file in the pdf format comprises the following steps:
s101, analyzing pdf customer order files page by page, and searching Tj or TJ labels from the pdf customer order files;
s102, acquiring character string content and position information of the character string content from Tj or TJ tags;
s103, analyzing the pdf customer order file page by page, searching for the l or re label from the pdf customer order file, and obtaining position information of a drawn line or a drawn rectangle;
s104, according to the positions of a plurality of drawn lines or a plurality of rectangles, the position range of the form block in the order file is synthesized;
s105, comparing and judging whether the character strings obtained from the Tj or TJ labels belong to the character strings in the table according to the position range of the table blocks, and dividing the character strings in each page into two types according to the character strings, wherein one type belongs to the table blocks and the other type does not belong to the table blocks.
In the technical scheme, the method is combined into the plain text file in the step S2 and comprises the following steps:
s201, dividing character strings which do not belong to form blocks in each page according to positions and sequencing the character strings row by row;
s202, integrating the character strings belonging to the table blocks in each page into a table expressed by a plurality of rows of character strings according to the shape of the rows and the columns of the table, and calculating the initial row position of the table;
s203, inserting the whole table into the non-table character string in rows according to the row position sequence in each page, and combining to form a plain text page;
s204, merging and outputting each plain text page into a plain text file according to the page sequence.
In the technical scheme, each row of information in a table expressed by a plurality of rows of character strings is expressed by a row of character strings, and column information of the table is expressed by a column fixed length and a column spacer in the row of character strings.
In the technical scheme, in step S3, according to characteristics of pdf file information in a customer order, the following two types of information are classified:
non-form order key information, according to format characteristics of a customer describing order information in the pdf file, adopting regular expression programming corresponding to the order key information format, capturing and extracting corresponding order key information from the combined plain text file;
and the form order key information adopts regular expression programming corresponding to the form information format according to the format characteristics of the client describing the order information in the pdf file, and captures and extracts the corresponding order key information in the form from the merged plain text file.
A system for automatically extracting order information from an order file in pdf format, comprising:
the analysis module 10 is used for analyzing the customer order file in the pdf format to obtain paging block information which is ordered according to the character string position;
the merging module 20 is configured to merge the paged and blocked information sequenced according to the character string positions line by line into a plain text file;
the capturing module 30 is configured to capture and extract order key information from the combined plain text file by adopting regular expression programming according to characteristics of pdf file information in the customer order.
In this technical solution, the parsing module 10 includes:
the character analysis module 101 is used for analyzing the pdf file page by page, searching for a Tj or Tj label from the pdf file page by page, and acquiring character string content and position information of the character string content from the Tj or Tj label;
the line drawing analysis module 102 is used for analyzing the pdf file page by page, searching l or re labels from the pdf file, acquiring the position information of the line drawing or the rectangle drawing, and synthesizing the position range of the form blocks in the order file according to the positions of a plurality of line drawing or a plurality of rectangles;
the table parsing module 103 is configured to compare and determine, according to the location range of the table block, whether the character string obtained from the Tj or Tj tag belongs to a character string in the table, and divide the character string in each page into two types according to the comparison, where one type belongs to the table block and the other type does not belong to the table block.
In this embodiment, the merging module 20 includes:
a single page merging module 201, configured to divide the strings in each page that do not belong to the table blocks according to the positions and sort the strings line by line; the character strings belonging to the table blocks in each page are synthesized into a table expressed by a plurality of rows of character strings according to the shape of the rows and the columns of the table, and the initial row positions of the table are calculated; inserting the whole form into the non-form character string in rows according to the row position sequence in each page, and combining to form a plain text page;
the multi-page merging module 202 is configured to merge and output each plain text page into a plain text file according to the page order.
In this technical solution, the capturing module 30 includes:
the non-table capturing module 301 is configured to capture and extract corresponding order key information from the merged plain text file by programming the non-table order key information according to format characteristics described by the client for the order information in the pdf file with a regular expression corresponding to the format of the order key information;
the table capturing module 302 is configured to capture and extract, for the table type order key information, the corresponding order key information in the table from the merged plain text file by adopting regular expression programming corresponding to the format of the table information according to the format characteristics described by the client for the order information in the pdf file.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. A method for automatically extracting order information from pdf format order files, comprising the steps of:
s1, analyzing a customer order file in a pdf format to obtain paging block information which is ordered according to character string positions;
s2, merging the paging and blocking information sequenced according to the character string positions line by line into a plain text file;
s3, according to the characteristics of pdf file information in the customer order, adopting regular expression programming, and capturing and extracting key information of the order from the combined plain text file;
in step S1, the customer order file in pdf format is parsed, including the steps of:
s101, analyzing pdf customer order files page by page, and searching Tj or TJ labels from the pdf customer order files;
s102, acquiring character string content and position information of the character string content from Tj or TJ tags;
s103, analyzing the pdf customer order file page by page, searching for the l or re label from the pdf customer order file, and obtaining position information of a drawn line or a drawn rectangle;
s104, according to the positions of a plurality of drawn lines or a plurality of rectangles, the position range of the form block in the order file is synthesized;
s105, comparing and judging whether the character strings obtained from the Tj or TJ labels belong to the character strings in the table according to the position range of the table blocks, and dividing the character strings in each page into two types according to the character strings, wherein one type belongs to the table blocks and the other type does not belong to the table blocks;
the method is combined into a plain text file in the step S2, and comprises the following steps of:
s201, dividing character strings which do not belong to form blocks in each page according to positions and sequencing the character strings row by row;
s202, integrating the character strings belonging to the table blocks in each page into a table expressed by a plurality of rows of character strings according to the shape of the rows and the columns of the table, and calculating the initial row position of the table;
s203, inserting the whole table into the non-table character string in rows according to the row position sequence in each page, and combining to form a plain text page;
s204, merging and outputting each plain text page into a plain text file according to the page sequence;
in step S3, the characteristics of pdf file information in a customer order are classified into the following two types:
non-form order key information, according to format characteristics of a customer describing order information in the pdf file, adopting regular expression programming corresponding to the order key information format, capturing and extracting corresponding order key information from the combined plain text file;
and the form order key information adopts regular expression programming corresponding to the form information format according to the format characteristics of the client describing the order information in the pdf file, and captures and extracts the corresponding order key information in the form from the merged plain text file.
2. The method for automatically extracting order information from pdf-formatted order files of claim 1 wherein: each row of information in a table expressed in rows of strings is expressed in a row of strings, and column information of the table is expressed in a column of fixed length and column spacers in the row of strings.
3. A system for implementing the method for automatically extracting order information from pdf-formatted order files of claim 2, comprising:
the analysis module (10) is used for analyzing the customer order file in the pdf format to obtain paging block information which is ordered according to the character string position;
the merging module (20) is used for merging the paging block information sequenced according to the character string positions into a plain text file line by line;
and the capturing module (30) is used for capturing and extracting order key information from the combined plain text file by adopting regular expression programming according to the characteristics of pdf file information in the customer order.
4. A system as claimed in claim 3, wherein the parsing module (10) comprises:
the character analysis module (101) is used for analyzing the pdf file page by page, searching Tj or TJ labels from the pdf file page by page, and acquiring character string contents and position information of the character string contents from the Tj or the TJ labels;
the line drawing analysis module (102) is used for analyzing the pdf file page by page, searching l or re labels from the pdf file, acquiring the position information of the line drawing or the rectangle drawing, and synthesizing the position range of the form block in the order file according to the positions of a plurality of line drawing or a plurality of rectangles;
and the table analysis module (103) is used for comparing and judging whether the character strings acquired from the Tj or TJ labels belong to the character strings in the table according to the position range of the table blocks, and dividing the character strings in each page into two types according to the character strings, wherein one type belongs to the table blocks and the other type does not belong to the table blocks.
5. The system according to claim 4, wherein the combining module (20) comprises:
a single page merging module (201) for dividing the character strings which do not belong to the table blocks in each page according to the positions and sequencing the character strings row by row; integrating the character strings belonging to the table blocks in each page into a table expressed by a plurality of rows of character strings according to the shape of the rows and columns of the table, and calculating the initial row position of the table; inserting the whole form into the non-form character string in rows according to the row position sequence in each page, and combining to form a plain text page;
and the multi-page merging module (202) is used for merging and outputting each plain text page into a plain text file according to the page sequence.
6. A system as claimed in claim 3, wherein the capturing module (30) comprises:
the non-form capturing module (301) is used for capturing and extracting corresponding order key information from the combined plain text file by adopting regular expression programming corresponding to the format of the order key information according to the format characteristics of the client describing the order information in the pdf file;
and the table capturing module (302) is used for capturing and extracting the corresponding order key information in the table from the merged plain text file by adopting regular expression programming corresponding to the format of the table information according to the format characteristics of the client describing the order information in the pdf file.
CN201911297269.XA 2019-12-17 2019-12-17 Method and system for automatically extracting order information from pdf format order Active CN111027285B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911297269.XA CN111027285B (en) 2019-12-17 2019-12-17 Method and system for automatically extracting order information from pdf format order

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911297269.XA CN111027285B (en) 2019-12-17 2019-12-17 Method and system for automatically extracting order information from pdf format order

Publications (2)

Publication Number Publication Date
CN111027285A CN111027285A (en) 2020-04-17
CN111027285B true CN111027285B (en) 2023-06-16

Family

ID=70209589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911297269.XA Active CN111027285B (en) 2019-12-17 2019-12-17 Method and system for automatically extracting order information from pdf format order

Country Status (1)

Country Link
CN (1) CN111027285B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001768A (en) * 2020-07-10 2020-11-27 苏宁云计算有限公司 E-commerce platform shop opening method and device based on robot process automation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081736A (en) * 2009-11-27 2011-06-01 株式会社理光 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
CN103530574A (en) * 2013-09-23 2014-01-22 中山大学 Method for inserting and extracting hidden information based on English PDF document
CN109062874A (en) * 2018-06-12 2018-12-21 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of financial data
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9785982B2 (en) * 2011-09-12 2017-10-10 Doco Labs, Llc Telecom profitability management
US20160055376A1 (en) * 2014-06-21 2016-02-25 iQG DBA iQGATEWAY LLC Method and system for identification and extraction of data from structured documents
CN106331354B (en) * 2016-08-26 2019-06-04 商客通尚景科技(上海)股份有限公司 A kind of short message information extracting and analysis method
US20190122043A1 (en) * 2017-10-23 2019-04-25 Education & Career Compass Electronic document processing
CN108595402A (en) * 2018-04-28 2018-09-28 西安极数宝数据服务有限公司 A kind of system of extraction PDF form datas
CN109614596B (en) * 2018-12-13 2020-07-07 税友软件集团股份有限公司 Electronic bill processing method, device and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081736A (en) * 2009-11-27 2011-06-01 株式会社理光 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
CN103530574A (en) * 2013-09-23 2014-01-22 中山大学 Method for inserting and extracting hidden information based on English PDF document
CN109062874A (en) * 2018-06-12 2018-12-21 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of financial data
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document

Also Published As

Publication number Publication date
CN111027285A (en) 2020-04-17

Similar Documents

Publication Publication Date Title
CN107622230B (en) PDF table data analysis method based on region identification and segmentation
CN107729526B (en) Text structuring method
CN109669933B (en) Transaction data intelligent processing method and device and computer readable storage medium
US11010543B1 (en) Systems and methods for table extraction in documents
CN108984593A (en) The method that multi-format text keeps off typing and compares
CN112926299B (en) Text comparison method, contract review method and auditing system
CN106407450A (en) File searching method and apparatus
GB2487600A (en) System for extracting data from an electronic document
CN110909123A (en) Data extraction method and device, terminal equipment and storage medium
WO2019041442A1 (en) Method and system for structural extraction of figure data, electronic device, and computer readable storage medium
CN111027285B (en) Method and system for automatically extracting order information from pdf format order
CN106777259A (en) The method and device of structured message in adaptive decimation HTML Table labels
GB2588251A (en) Partial perceptual image hashing for invoice deconstruction
CN110287784A (en) A kind of annual report text structure recognition methods
CN106649308B (en) Word segmentation and word library updating method and system
CN111291547B (en) Template generation method, device, equipment and medium
CN114065719A (en) Document processing method and device, electronic equipment and computer readable storage medium
CN111966640A (en) Document file identification method and system
CN109145879B (en) Method, equipment and storage medium for identifying printing font
CN111985881A (en) Intelligent contract review system and method
CN116796707A (en) Document multi-format data filling and modularized automatic generation method
CN107145947B (en) Information processing method and device and electronic equipment
CN106815196B (en) Soft text display frequency statistical method and device
CN114417820A (en) Content filtering method for target object
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant