CN112395874A - Order information correction method, device, equipment and storage medium - Google Patents

Order information correction method, device, equipment and storage medium Download PDF

Info

Publication number
CN112395874A
CN112395874A CN202011339777.2A CN202011339777A CN112395874A CN 112395874 A CN112395874 A CN 112395874A CN 202011339777 A CN202011339777 A CN 202011339777A CN 112395874 A CN112395874 A CN 112395874A
Authority
CN
China
Prior art keywords
information
order
corrected
target
text box
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011339777.2A
Other languages
Chinese (zh)
Inventor
张斌
彭佳玮
陈凯歌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sensetime International Pte Ltd
Original Assignee
Sensetime International Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sensetime International Pte Ltd filed Critical Sensetime International Pte Ltd
Priority to CN202011339777.2A priority Critical patent/CN112395874A/en
Publication of CN112395874A publication Critical patent/CN112395874A/en
Priority to PCT/IB2021/055848 priority patent/WO2022112857A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0633Lists, e.g. purchase orders, compilation or processing
    • G06Q30/0635Processing of requisition or of purchase orders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/14Travel agencies

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Remote Sensing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A correction method of order information is disclosed, the method comprises the following steps: obtaining order information to be corrected according to a text recognition result of the order; determining target search information from the text recognition result; acquiring order reference information matched with the target search information in a preset search mode; and correcting the order information to be corrected by using the order reference information to obtain target order information.

Description

Order information correction method, device, equipment and storage medium
Technical Field
The present disclosure relates to computer vision technologies, and in particular, to a method, an apparatus, a device, and a storage medium for correcting order information.
Background
Currently, OCR (Optical Character Recognition) technology is widely used in many fields and industries, and most text characters in text data images can be recognized by the technology. However, errors may occur in the information extracted from the OCR results due to the accuracy of the OCR results. How to obtain accurate information from the OCR results is still under further study.
Disclosure of Invention
The embodiment of the disclosure provides a correction scheme of order information.
According to an aspect of the present disclosure, there is provided a method for correcting order information, the method including: obtaining order information to be corrected according to a text recognition result of the order; determining target search information from the text recognition result; acquiring order reference information matched with the target search information in a preset search mode; and correcting the order information to be corrected by using the order reference information to obtain target order information.
In combination with any one of the embodiments provided by the present disclosure, the target search information includes a part of the content of the order information to be corrected; the partial content includes at least one of a subject name, at least one element.
In combination with any embodiment provided by the present disclosure, the obtaining of the order reference information matched with the target search information in a preset search manner includes at least one of the following: accessing a setting database to acquire order reference information matched with the target search information from the setting database; and obtaining order reference information matched with the target search information through the Internet.
In conjunction with any one of the embodiments provided herein, the provisioning database includes a plurality of levels of reference cell information, and the reference cell information of a lowest level of the plurality of levels corresponds to a plurality of reference subject names.
In combination with any one of the embodiments provided by the present disclosure, the setting database stores first reference information corresponding to a reference subject name; determining target search information from the text recognition result, including: acquiring unit information of the lowest level in the order information to be corrected according to the level division in the set database; the obtaining of the order reference information matched with the target search information from the setting database includes: determining target unit information matched with the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level in the set database; determining a target subject name which meets a preset condition in a plurality of reference subject names corresponding to the target unit information; and obtaining order reference information matched with the target search information according to the first reference information corresponding to the target subject name.
In combination with any one of the embodiments provided by the present disclosure, the setting database stores second reference information corresponding to a reference subject name; the obtaining of the order reference information matched with the target search information from the setting database includes: acquiring unit information of the lowest level in the order information to be corrected according to the level division in the set database; determining target unit information matched with the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level in the set database; determining a target subject name which meets a preset condition in a plurality of reference subject names corresponding to the target unit information; and obtaining order reference information matched with the target search information according to the reference unit information of each hierarchy corresponding to the target subject name and the second reference information corresponding to the target subject name.
In combination with any one of the embodiments provided by the present disclosure, the determining a target subject name that meets a preset condition in a plurality of reference subject names corresponding to the target unit information includes: matching the subject names corresponding to the order information to be corrected with a plurality of reference subject names corresponding to the target unit information respectively; and determining the reference subject name with the highest matching score and exceeding a first set threshold value as the target subject name.
In combination with any one of the embodiments provided by the present disclosure, the obtaining, through the internet, the order reference information matched with the target search information includes: searching in the Internet according to partial content of the order information to be corrected to obtain at least one piece of reference information matched with the target search information; matching reference information corresponding to the target search information with the order information to be corrected; and acquiring the order reference information with the highest matching score and exceeding a second set threshold.
In combination with any embodiment provided by the present disclosure, the method further comprises: and adding the order reference information acquired from the Internet and the subject name corresponding to the order information to be corrected into the information corresponding to the reference unit information of the lowest level in the set database.
In combination with any embodiment provided by the present disclosure, the method further comprises: and updating the information corresponding to the reference unit information of the lowest level in the set database according to the order reference information acquired from the Internet and the subject name corresponding to the order information to be corrected.
In combination with any embodiment provided by the present disclosure, the order information to be corrected at least includes address information, and at least one element included in the order information to be corrected includes at least one of the following: the system comprises a setting database, administrative areas and postal codes, wherein the reference unit information of a plurality of levels included in the setting database comprises reference administrative area information or postal code information.
In combination with any embodiment provided by the present disclosure, the obtaining order information to be corrected according to the text recognition result of the order includes: acquiring a text recognition result of the order, wherein the text recognition result comprises a plurality of text boxes; determining a first text box containing key information from the plurality of text boxes, wherein the key information comprises partial content of the order information to be corrected, and the partial content comprises at least one element in the order information to be corrected and at least one of keywords indicating the order information to be corrected; combining at least part of the text boxes according to the first text box to obtain a combined text box; and acquiring the order information to be corrected from the combined text box.
According to an aspect of the present disclosure, there is provided an apparatus for correcting order information, the apparatus including: the acquisition unit is used for acquiring the information of the order to be corrected according to the text recognition result of the order; a determination unit configured to determine target search information from the text recognition result; the matching unit is used for acquiring order reference information matched with the target search information in a preset search mode; and the correcting unit is used for correcting the order information to be corrected by using the order reference information to obtain target order information.
In combination with any one of the embodiments provided by the present disclosure, the target search information includes a part of the content of the order information to be corrected; the partial content includes at least one of a subject name, at least one element.
In combination with any embodiment provided by the present disclosure, the matching unit is specifically configured to at least one of: accessing a setting database to acquire order reference information matched with the target search information from the setting database; and obtaining order reference information matched with the target search information through the Internet.
In conjunction with any one of the embodiments provided herein, the provisioning database includes a plurality of levels of reference cell information, and the reference cell information of a lowest level of the plurality of levels corresponds to a plurality of reference subject names.
In combination with any one of the embodiments provided by the present disclosure, the setting database stores first reference information corresponding to a reference subject name; the determining unit is specifically configured to: acquiring unit information of the lowest level in the order information to be corrected according to the level division in the set database; the obtaining of the order reference information matched with the target search information from the setting database includes: determining target unit information matched with the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level in the set database; determining a target subject name which meets a preset condition in a plurality of reference subject names corresponding to the target unit information; and obtaining order reference information matched with the target search information according to the first reference information corresponding to the target subject name.
In combination with any one of the embodiments provided by the present disclosure, the setting database stores second reference information corresponding to a reference subject name; the matching unit is specifically configured to: acquiring unit information of the lowest level in the order information to be corrected according to the level division in the set database; determining target unit information matched with the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level in the set database; determining a target subject name which meets a preset condition in a plurality of reference subject names corresponding to the target unit information; and obtaining order reference information matched with the target search information according to the reference unit information of each hierarchy corresponding to the target subject name and the second reference information corresponding to the target subject name.
In combination with any embodiment provided by the present disclosure, the matching unit is configured to determine, among a plurality of reference subject names corresponding to the target unit information, a target subject name that meets a preset condition, and specifically configured to: matching the subject names corresponding to the order information to be corrected with a plurality of reference subject names corresponding to the target unit information respectively; and determining the reference subject name with the highest matching score and exceeding a first set threshold value as the target subject name.
In combination with any one of the embodiments provided by the present disclosure, the matching unit is specifically configured to: searching in the Internet according to partial content of the order information to be corrected to obtain at least one piece of reference information matched with the target search information; matching reference information corresponding to the target search information with the order information to be corrected; and acquiring the order reference information with the highest matching score and exceeding a second set threshold.
In combination with any embodiment provided by the present disclosure, the apparatus further includes an adding unit, configured to add the order reference information acquired from the internet and the subject name corresponding to the order information to be corrected to information corresponding to reference unit information at a lowest level in the setting database.
In combination with any embodiment provided by the present disclosure, the apparatus further includes an updating unit, configured to update information corresponding to reference unit information of a lowest level in the setting database according to the order reference information acquired from the internet and a subject name corresponding to the order information to be corrected.
In combination with any embodiment provided by the present disclosure, the order information to be corrected at least includes address information, and at least one element included in the order information to be corrected includes at least one of the following: the system comprises a setting database, administrative areas and postal codes, wherein the reference unit information of a plurality of levels included in the setting database comprises reference administrative area information or postal code information.
In combination with any one of the embodiments provided by the present disclosure, the obtaining unit is specifically configured to: acquiring a text recognition result of the object to be processed, wherein the text recognition result comprises a plurality of text boxes; determining a first text box containing key information from the plurality of text boxes, wherein the key information comprises partial content of the order information to be corrected, and the partial content comprises at least one element in the order information to be corrected and at least one of keywords indicating the order information to be corrected; combining at least part of the text boxes according to the first text box to obtain a combined text box; and acquiring the order information to be corrected from the combined text box.
According to an aspect of the present disclosure, an electronic device is provided, which includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement a method for correcting order information according to any embodiment of the present disclosure when executing the computer instructions.
According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for correcting order information according to any one of the embodiments of the present disclosure.
According to the order information correction method, device, equipment and storage medium of one or more embodiments of the present disclosure, order information to be corrected is obtained according to a text recognition result of an order, target search information is determined from the text recognition result, order reference information matched with the target search information is obtained in a preset search mode, the order reference information is used to correct the order information to be corrected to obtain the target order information, and accurate target order information can be quickly obtained from the text recognition result of the order.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
Fig. 1 is a flowchart of a method for correcting order information according to at least one embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a setting database in a method for correcting order information according to at least one embodiment of the present disclosure;
fig. 3A, 3B, and 3C are schematic diagrams of an information extraction method according to at least one embodiment of the disclosure;
fig. 4 is a schematic diagram of an apparatus for correcting order information according to at least one embodiment of the disclosure;
fig. 5 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Fig. 1 illustrates a flow chart of a method of correcting order information according to some embodiments of the present disclosure. As shown in fig. 1, the method includes steps 101 to 104.
In step 101, order information to be corrected is obtained according to the text recognition result of the order.
In an embodiment of the present disclosure, the order for text recognition includes at least one of: an order image, an order in the form of an electronic document, such as a pdf document. It will be appreciated by those skilled in the art that the order may also include other types suitable for text recognition.
In one example, a text box contained in an order may be obtained by performing text detection on the order; recognizing text characters in the text box by performing text recognition on the obtained text box, so as to obtain a text recognition result; text recognition, such as OCR, can also be directly performed on the order to be processed to obtain a text recognition result of a text box included in the order. The embodiment of the present disclosure does not limit the specific method for obtaining the text recognition result.
And the order information to be corrected is obtained from the text recognition result of the order according to a set rule. For example, in the case where the order information to be corrected contains address information, the address information to be corrected may be acquired from the text recognition result according to the rule of the address information.
In step 102, target search information is determined from the text recognition result. The target search information is related to the order information to be corrected or information capable of reflecting the information characteristics of the order information to be corrected.
In one example, the target search information includes a part of the content of the order information to be corrected; the partial content includes at least one of a subject name, at least one element. Taking address information as an example, the partial content in the order to be corrected indicated by the target search information may include a subject name (e.g., name, place name, etc.) to which the address information belongs, and at least one element included in the address information (e.g., each administrative district, zip code corresponding to each administrative district, etc.).
In step 103, obtaining order reference information matched with the target search information through a preset search mode.
In the embodiment of the present disclosure, the order reference information matched with the target search information may be acquired from a setting database by accessing the setting database. The setting database stores a plurality of reference subject names and corresponding reference information. For example, in a case where the processing information is address information, the setting database is a database storing a plurality of subject names and corresponding address information, and according to the subject name corresponding to the order information to be corrected, such as "XX hotel" and a postal code, a matching "XX hotel" may be searched in the setting database, and the corresponding address information is used as order reference information.
In the embodiment of the present disclosure, the order reference information matched with the target search information may also be acquired through the internet. Still taking the address information as an example, a search engine may be used to search in the internet according to the subject name and the zip code corresponding to the order information to be corrected, and the information corresponding to the retrieved matching subject name is used as the order reference information.
In the embodiment of the present disclosure, the order reference information matched with the target search information may also be acquired from a setting database and the internet at the same time. In the case of order reference information obtained from both the setting database and the internet, either one of them, or a designated one, may be used as the target order reference information; in the case where only the order reference information is acquired from the internet, the setting database may be updated with the search result of the internet.
In step 104, the order information to be corrected is corrected by using the order reference information to obtain target order information.
In an embodiment of the present disclosure, an order information correction method, an order information correction device, an order information correction apparatus, an order information correction device, and a storage medium according to at least one embodiment of the present disclosure obtain order information to be corrected according to a text recognition result of an order, determine target search information from the text recognition result, obtain order reference information matched with the target search information in a preset search mode, correct the order information to be corrected by using the order reference information to obtain target order information, and can quickly obtain accurate target order information from the text recognition result of the order.
The address database in the related art generally only supports the query from the subject name to the address, and has certain fault tolerance only at the beginning and end of the input word. Since the correction method for the order information provided by the embodiment of the disclosure acquires the matched order reference information according to the target search information determined from the text recognition result, and the target search information may be part of the content in the order information to be corrected, such as a subject name, or an element in the order information to be corrected, even if there is wrong information or even wrong subject name in the order information, the correction method can also use other information in the order information as the target search information to acquire the order reference information to correct the order information to be corrected, and has higher fault tolerance.
In addition, since the acquisition of the target search information is independent of the text arrangement mode of the order, the method for correcting the order information provided by at least one embodiment of the disclosure is applicable to orders with different layouts.
In some embodiments, from the text recognition result of the order, a subject name corresponding to the order information to be corrected may be acquired as the target search information. The subject name and the to-be-corrected order information are, for example, key-value pair information, where the subject name indicates an attribute, and the to-be-corrected order information indicates a value of the attribute.
In an example, the order information to be corrected may be address information, a main body corresponding to the address information is an object to which the address information belongs, and a corresponding main pair name is a name of the object to which the address information belongs. For example, in the case where the object to which the address information belongs is a person, the corresponding subject name is a name; and if the object to which the address information belongs is a place, the corresponding subject name is a place name. The order information to be corrected can also be identity information, and the name of a main body corresponding to the identity information is a name. It should be understood by those skilled in the art that the to-be-corrected order information may also be other types of information, and the disclosure is not limited thereto.
In some embodiments, the setting database may include a plurality of levels of reference cell information, and the reference cell information of each lowest level of the plurality of levels corresponds to a plurality of reference subject names. In the setting database, the reference unit information is organized and stored according to the hierarchy levels from top to bottom, and the reference unit information with lower hierarchy levels has narrower corresponding range or lower authority. And the lowest level is the reference information unit with the minimum corresponding range or the lowest authority. Taking a setting database for storing address information as an example, the reference unit information of multiple levels included in the setting database includes reference administrative area information and/or postal code information, and the reference unit information of the lowest level includes an administrative area name with the smallest range and/or a postal code corresponding to the smallest administrative area.
In one example, the reference unit information in the setting database may be stored in a tree structure, and non-leaf nodes of different levels store reference unit information of different levels, and the leaf nodes are used for storing reference subject names belonging to nodes of a previous level.
In some embodiments, the setting database further stores first reference information corresponding to each reference subject name. The first reference information is usually complete information corresponding to the reference subject name, and includes reference unit information of each level and specific reference information corresponding to the reference subject name. Taking address information as an example, the first reference information may be complete address information, including information of administrative districts of each level and specific addresses corresponding to reference subject names, such as streets and/or units. The first reference information is obtained in advance, and has reference information corresponding to the reference subject name with higher credibility and accuracy.
Taking the to-be-corrected order information as the address information in the hotel order as an example, the reference unit information of the multiple levels in the setting database may be administrative districts of the multiple levels. The tree structure for storing the address information can be a rooted tree structure, and a root node has no actual meaning; the root child nodes may be used to store the traveler of the order (e.g., XX travel agency), and the remaining non-leaf nodes may be used to store the administrative district components or zip codes of the country; each leaf node may store an object name, and each leaf node may also store complete address information corresponding to the object name. In the subtree corresponding to the same traveler, all non-leaf nodes are unique, and the parent node of the non-leaf nodes represents its own direct high administrative district.
Fig. 2 is a schematic structural diagram of a setting database in a method for correcting order information according to at least one embodiment of the present disclosure. As shown in FIG. 2, the traveller's subtree can be constructed according to a top-down (shallow-to-deep) hierarchy: the country-province-city-district, in some cases, the next level of the district may also include sub-areas, and the respective administrative districts may also be replaced with postal codes, for example, constructed as country-province-postal code-district. It should be understood by those skilled in the art that the above is merely an example, and that the zip code may be substituted for any administrative area, and the present disclosure is not limited thereto.
For the setting database for storing address information, the reference administrative district information of each level stored in the tree structure can be obtained according to the administrative division table of each country and the correspondence table of the postal code and the administrative district, which are published on the internet; the reference subject name stored by the leaf node and the corresponding first reference information can be obtained by manual labeling.
In the case that the first reference information corresponding to each reference subject name is also stored in the setting database, the order reference information corresponding to the order information to be corrected may be obtained in the following manner.
First, the unit information of the lowest hierarchy in the order information to be corrected may be acquired as target search information according to hierarchy division in the setting database.
Taking the to-be-corrected order information as the address information of the hotel order as an example, the unit information of each hierarchy included in the to-be-corrected order information can be obtained according to hierarchy division of the address in the set database, that is, according to the tree structure of the database. For example, splitting the order information to be corrected according to a tree structure "country-province-city-district" in the database, the administrative district information of each level included in the address information can be obtained. Here, the lowest-level administrative district information may be used as the target search information. For example, if the minimum administrative area included in the address information is a sub-area, the sub-area information may be used as the target search information; in a case where the minimum administrative area included in the address information is an area, the area information may be used as target search information; other cases are similar and will not be described in detail.
Next, target unit information that matches the unit information of the lowest hierarchy level in the order information to be corrected is determined among the reference unit information of the lowest hierarchy level in the setting database. That is, in the tree structure of the setting database, the position of the unit information of the lowest hierarchy in the order information to be corrected is located. In the tree structure of the setting database, the location where the unit information of the lowest hierarchy is stored is located, that is, the reference unit information corresponding to (matching) the unit information of the lowest hierarchy is determined, and the reference unit information is used as the target unit information.
And then, determining a target subject name which meets a preset condition in a plurality of reference subject names corresponding to the target unit information.
The reference cell information of each lowest hierarchy in the setting database corresponds to a plurality of reference subject names, and thus, among the plurality of reference subject names, a target subject name can be determined according to a preset condition.
In one example, the subject name corresponding to the order information to be corrected may be matched with a plurality of reference subject names corresponding to the target unit information, respectively; and determining the reference subject name with the highest matching score exceeding a first set threshold as the target subject name.
And finally, obtaining order reference information matched with the target searching information according to the first reference information corresponding to the target subject name.
And under the condition that the reference information corresponding to each reference subject name is stored in the preset library, according to the first reference information corresponding to the determined target subject name, obtaining the order reference information of the order information to be corrected. The reference information stored in the setting database has higher reliability and accuracy, and the order information to be corrected is corrected by utilizing the reference information, so that more accurate target order information can be obtained.
In some embodiments, the setting database stores second reference information corresponding to each reference subject name. The second reference information is other reference information than the reference unit information of each level, and is generally more specific information than the reference unit information of each level. Taking the to-be-corrected order information as address information included in the hotel order as an example, the second reference information may be, for example, a street and/or a unit where the hotel is located. The second reference information is obtained in advance, and has reference information corresponding to the reference subject name with higher credibility and accuracy.
When the second reference information corresponding to each reference subject name is stored in the setting database, the method for determining the target subject name is similar to the method described above, except that after the target subject name is determined, the order reference information corresponding to the order information to be corrected is obtained according to the reference unit information of each level corresponding to the target subject name and the second reference information corresponding to the target subject name.
According to the reference unit information of each level corresponding to the target subject name and the second reference information corresponding to the target subject name, the complete information of the target subject name can be obtained, and the order information to be corrected is corrected according to the complete information, so that more accurate and complete target order information can be obtained.
In some embodiments, the order reference information corresponding to the order information to be corrected may also be acquired from the internet according to the target search information.
In an example, the order information to be corrected may be searched in the internet according to a part of content of the order information to be corrected, for example, a subject name or at least one element, to obtain order reference information corresponding to at least one subject name, and match the order reference information corresponding to the subject name with the order information to be corrected, so as to obtain reference information with a highest matching score and exceeding a second set threshold.
Still taking the to-be-corrected order information as address information of a hotel order as an example, the target search information may include a zip code included in the address information and/or administrative area information of one of the levels. And retrieving according to the subject name corresponding to the address information, namely the hotel name, and at least one element included in the order information to be corrected, so that a plurality of address information which may be hotel addresses can be acquired from the internet. By carrying out fuzzy matching on the address information obtained from the interconnection network and the order information to be corrected according to address components, the address information which is highest in matching and exceeds a second set threshold value can be used as order reference information of the order information to be corrected for correction, so that more accurate hotel address information can be obtained.
In the case where the matching scores of two or more pieces of address information are the same, any one of the pieces of address information may be retained and the other pieces of address information may be deleted.
In the embodiment of the present disclosure, the organization and storage of each level of administrative district and postal code corresponding to the administrative district in the address database can be set according to the regulations of the target country, so that the correction method can be easily extended to the correction of the itinerary information of the destination of any country.
In some embodiments, the search may be performed in a configuration database first according to the target search information, and then in an interconnection.
When the target subject name matching the subject name corresponding to the order information to be corrected does not exist in the setting database, the reference information corresponding to the order information to be corrected and the subject name corresponding to the order information to be corrected, which are acquired from the internet, may be added to the information corresponding to the reference unit information at the lowest level in the setting database, that is, the subject name is added to the reference subject name corresponding to the reference unit information at the corresponding lowest level. For the setting database of the tree structure, the subject name corresponding to the order information to be corrected and the order reference information are stored in the leaf nodes of the tree structure, and become the newly added reference subject name and the corresponding first reference information.
In the case where the reference information acquired from the setting database is inconsistent with the reference information acquired from the internet, the information corresponding to the reference unit information of the lowest hierarchy in the setting database may be updated according to the reference information corresponding to the order information to be corrected acquired from the internet and the subject name corresponding to the order information to be corrected. That is, the reference information corresponding to the order to be corrected, which is obtained from the internet, is used to replace the reference information of the target subject name corresponding to the reference unit information of the lowest level in the setting database. For the setting database of the tree structure, that is, the reference information corresponding to the order information to be corrected is replaced with the reference information corresponding to the reference subject name originally stored in the leaf node of the tree structure, so that the reference information of the reference subject name is updated.
In one example, before updating the reference information of the reference subject name, the latest update time of the reference information corresponding to the order information to be corrected, which is obtained from the internet, may be obtained, and it may be determined whether to update the reference information of the reference subject name based on the update time. For example, if the last update time is within a set time range, such as within the last year or within the last 6 months, the update may be performed; conversely, if the latest update time is beyond the set time range, a prompt message may be output to determine whether to perform the update by a technician to avoid a false update.
In the embodiment of the present disclosure, the reference information acquired from the internet is used to add and update the setting database, so that the reliability and accuracy of the reference information acquired from the setting database can be obtained, and more accurate order information to be corrected can be acquired from the order to be processed.
When the outbound travel application visa is transacted, hotel information needs to be filled in and a hotel travel list needs to be provided for examination. Text recognition and information extraction for hotel itineraries can reduce cumbersome user filling and simplify the review process, however, errors may occur in the information extracted from the OCR results due to the accuracy of the OCR results.
In the related art, an N-gram (N-gram) is usually used for correcting a text recognition result, however, training of the N-gram depends on a thesaurus, and a thesaurus of address information, especially a thesaurus of foreign names, is usually incomplete, so that the N-gram is not good enough for correcting an order text recognition result of a hotel order class.
By applying the correction method of the order information provided by at least one embodiment of the disclosure to automatic visa processing, the hotel address information in the text recognition result of the hotel travel list can be corrected, for example, error information in the hotel address is corrected, or incomplete hotel addresses are complemented, so that the accuracy and reliability of automatic visa information filling are improved, the user experience is improved, and the approval process is facilitated to be accelerated. In addition, the correction method can be used for correcting by utilizing the reference information acquired from the internet or updating the set database according to the reference information acquired from the internet, so that the problem of incomplete word stock can be solved, and a better correction effect can be obtained.
In the embodiment of the present disclosure, the order information to be corrected at least includes address information and identity information, in which case, the order information to be corrected may be obtained from a text recognition result of the order to be processed by the following method.
Firstly, a text recognition result of the order is obtained, and the text recognition result comprises a plurality of text boxes.
Next, from the plurality of text boxes, a first text box containing key information is determined. The key information may include a part of content of the order information to be corrected, where the part of content includes at least one element of the order information to be corrected and at least one of a keyword indicating the order information to be corrected.
In the case that the order information to be corrected is address information, the key information may include an element "zip code" in the address information, and in the case that the region to which the address information belongs is known, the number of digits of the zip code may be determined. Taking the order information to be corrected as the thailand address as an example, since the thailand zip code is a 5-digit number, it can be determined that the key information is a 5-digit number. In this step, a text box containing 5 digits is determined as a first text box. In order to reduce the extra discrimination operation, considering the case that the identified content may include more than 5 digits, for example, the text box includes 8 digits, etc., the text box including only 5 digits may be determined as the first text box in the practical application process.
In some embodiments, for a found zip code, a search may also be made in a list of zip codes that utilize the area to which the found zip code belongs to confirm that the found zip code is indeed the area to which the zip code belongs.
Under the condition of the region to which the unknown address information belongs, the digit conditions of postal codes around the world can be integrated, and the key information can be determined as 4-9 digits. In this step, text boxes containing 4-9 digits are determined as the first text box, respectively. In one possible implementation, to reduce the extra discriminating operation, a text box containing only 4-9 digits may be determined as the first text box, i.e., disregarded for text boxes containing 10 or even more digits.
The key information may further include administrative district information, which is an element in the address information, for example, "thailand", and then, among the plurality of text boxes, a text box containing text contents such as "thailand" is determined as the first text box.
The key information further comprises a keyword indicating the order information to be corrected, and taking the order information to be corrected as an address as an example, the keyword comprises an address, an address and keywords indicating the address in other languages. In the present application, the form of the keyword is not limited, and may include various expressions such as a full name and an abbreviation, for example.
And then, combining at least part of the text boxes according to the first text box to obtain a combined text box.
In the embodiment of the present disclosure, the text box to be merged is determined based on the first text box. For example, the text box to be merged may be determined according to the position relationship with the first text box, and the text box to be merged may be merged to obtain a merged text box.
And finally, acquiring the order information to be corrected from the combined text box.
The order information to be corrected can be extracted from the combined text box according to the content contained in the combined text box or the format information of the combined text box, or according to the content contained in the combined text box and the format information of the combined text box.
In the embodiment of the disclosure, by determining a first text box containing key information in a plurality of text boxes contained in a text recognition result of an order to be processed, merging at least part of the text boxes according to the first text box to obtain a merged text box, and acquiring order information to be corrected from the merged text box, efficient information processing can be performed in the order to be processed according to the key information in the order information to be corrected.
In some embodiments, the text boxes may be merged in the following manner, resulting in a merged text box.
First, a positional relationship between each of the plurality of text boxes other than the first text box and the first text box is acquired. The positional relationship includes an orientation relationship of the other text box (i.e., any one of the text boxes other than the first text box or the specified text box) to the first text box, for example, above or below the first text box, and a distance to the first text box, for example, a pixel distance in a vertical direction to the first text box, and a pixel distance in a horizontal direction. Wherein the distance between the text boxes is determined according to the distance between the center points of the two text boxes.
And then, determining the text box of each text box, the position relation of which with the first text box belongs to a set range, as a second text box. For example, a text box above the first text box may be determined as the second text box, or a text box within a set threshold of pixel distance from the first text box in the vertical direction may be determined as the second text box, and so on.
And then combining the first text box and the second text box as text boxes to be combined to obtain the combined text box.
In the embodiment of the present disclosure, the text box to be merged is determined according to the position relationship between the plurality of text boxes in the text recognition result and the first text box containing the key information, and the merged text box is merged, so that the merged text box object can be narrowed to a range related to the order information to be corrected, the information processing amount is reduced, and the information processing efficiency is improved.
The merging of the text boxes to be merged may be performed on a line basis. That is, the text boxes to be merged are merged according to the line to which each text box belongs in the text boxes to be merged to obtain the merged text box.
And under the condition that the number of the text boxes belonging to the same line in the text boxes to be merged is one, determining one text box belonging to the same line as one merged text box.
And under the condition that the number of the text boxes belonging to the same line in the text boxes to be merged is multiple, merging the multiple text boxes belonging to the same line to obtain a merged text box.
Fig. 3A shows an exemplary merged result. As shown in FIG. 3A, the combined text box comprises a plurality of lines of combined text boxes, including combined text boxes 301-303, wherein each line of combined text box is obtained by combining one or more text boxes contained in the line.
In the embodiment of the present disclosure, the text boxes to be merged are merged for the line to which each text box belongs, so that the merged text box corresponding to each line is obtained, which is favorable for subsequent information processing.
In some embodiments, for a plurality of text boxes belonging to the same line, merging two adjacent text boxes if the distance between the two adjacent text boxes is smaller than a first threshold; and combining every two adjacent text boxes meeting the conditions in the same line to obtain a combined text box corresponding to the line. The first threshold may be specifically determined according to format characteristics of the order information to be corrected.
For a plurality of text boxes belonging to the same line, in the case that the distance between adjacent text boxes is greater than or equal to the first threshold, it indicates that the two adjacent text boxes may not be related contents, but belong to different order information to be corrected, and therefore the adjacent text boxes are not merged.
And under the condition that more than one combined text box is obtained by combining the adjacent text boxes in the same line, determining the combined text box corresponding to the line according to the position relation between the obtained combined text box and the first text box. For example, the merged text box closest to the first text box in the horizontal direction is taken as the final merged text box.
In the embodiment of the present disclosure, by limiting the merging condition between adjacent text boxes in the same line, text boxes with irrelevant content can be prevented from being merged into the merged text box, and the accuracy of information processing is improved.
In some embodiments, the information of the order to be corrected may be obtained from the combined text box according to the format characteristics of the order to be processed.
The format characteristics of the order to be processed comprise distance characteristics between each line of text, font characteristics of each line of text, position relation characteristics between the texts and the like.
According to the format characteristics, the target direction for obtaining the order information to be corrected can be determined, and the order information to be corrected can be obtained according to the target direction.
For example, when the order information to be corrected is address information and the key information is a zip code, since the zip code is usually located at the end of the address information, it can be determined that the order information to be corrected is located above the first text box, and thus a target direction for extracting the order information to be corrected can be determined, and the extraction is performed according to the target direction.
For another example, when the order information to be corrected is address information, and the key information is a keyword "address" indicating the address information, since the key "address" word is usually located at the head of the address information, it can be determined that the order information to be corrected is located below the first text box, and thus a target direction for extracting the order information to be corrected can be determined, and the extraction can be performed according to the target direction.
In the embodiment of the disclosure, the target direction is determined according to the format characteristics of the order to be processed, and the order information to be corrected is obtained from the combined text box according to the target direction, so that the efficiency of information processing can be improved.
In some embodiments, the target directions include a first target direction and a second target direction, the first target direction is used to indicate a direction for traversing the merged text box in the process of locating the area where the order information to be corrected is located, and the second target direction is used to indicate a direction for reading the order information to be corrected from the area where the order information to be corrected is located.
In one example, the first text box is taken as a starting position, the merged text box is traversed according to the first target direction until the merged text box where the key information is located is found; traversing the merged text box according to the second target direction by taking the key information as an initial position until the merged text box where the key information is located is found, and acquiring the content traversed according to the second target direction. Wherein the key information may include a key word indicating the order information to be corrected, at least one element of the order information to be corrected, a partial content of the order information to be corrected, and the like. Taking the to-be-corrected order information as an address as an example, the keywords indicating the address information include "address", and keywords indicating the address in other languages.
Referring to the exemplary merged text box shown in fig. 3A, the key information is the zip code 10110, and the first text box containing the zip code 10110 is taken as a starting position, that is, the merged text box is traversed upward from the merged text box 301 until the merged text box 302 where the key information "Address" is located is found. And traversing the combined text box downwards by taking the key information 'Address' as an initial position until the combined text box 301 where the key information 'postal code 10110' is located is found, and acquiring the content traversed downwards as the information of the order to be corrected. It should be noted that the "address" such as an english paraphrase is not limited to the form of upper case, lower case, or the like of some or all letters in a word, and can be adjusted in accordance with the actual situation. That is, in the actual identification and other processes, the ADDRESS, and the like may all be identified as "addresses" in the same processing manner.
In one example, the method further comprises: and acquiring the distance between the adjacent combined text boxes. Wherein the adjacent merged text box comprises two merged text boxes adjacent in a vertical direction. The plurality of merged text boxes obtained from the text recognition result include a plurality of pairs of adjacent merged text boxes. As shown in FIG. 3B, the merged text boxes 311-314 include adjacent merged text boxes 311-312, adjacent merged text boxes 312-313, and adjacent merged text boxes 313-314.
And traversing the merged text box according to the first target direction by taking the first text box as an initial position until the adjacent merged text box with the distance meeting a first set condition is found. And traversing comprises acquiring text content in the combined text box and acquiring the distance between the combined text box and the adjacent combined text box, wherein the adjacent combined text box is traversed during traversing the combined text box. And then, in the adjacent merged text boxes with the distance meeting the first set condition, the first traversed merged text box is taken as the initial position, the merged text box is traversed according to the second target direction until the merged text box where the key information is located is found, and the content traversed according to the second target direction is obtained. Wherein, the distance between adjacent merged text boxes meeting a first set condition comprises: the distance between the adjacent merged text boxes is greater than a first inter-box distance threshold.
Referring to the exemplary merged text box shown in fig. 3B, the key information is the zip code 10400, and the first text box containing the zip code is taken as the starting position, that is, the first text box containing "10400" is taken as the starting position, that is, the merged text box is traversed upwards starting from the merged text box 311. Taking the traversal to the merge text box 312 as an example, the method includes obtaining the content in the merge text box 312 and obtaining the distance between the merge text box 312 and the merge text box 311. The distance between the two text boxes may be a pixel distance between center points of the two text boxes in the vertical direction, or a pixel distance between corresponding positions of the two text boxes may be used as the distance between the two text boxes, for example, in the case that the two text boxes are aligned left, corner points of the two text boxes located at the upper left corner or the lower left corner may be used as two vertices for determining the distance, and the pixel distance between the two vertices may be used as the distance between the two text boxes. Of course, the distance between two text boxes may also be determined in other ways similar to those described above. The specific implementation process is not limited in the present application, and may include, but is not limited to, the above-mentioned cases. In the case that the distance between the merge text box 312 and the merge text box 311 does not satisfy the first set condition, that is, the distance between the merge text box 312 and the merge text box 311 is less than or equal to the first inter-box distance threshold, the upward traversal continues. In the case where it is detected that the distance between the merge text box 314 and the merge text box 313 satisfies the first set condition, that is, the distance between the merge text box 314 and the merge text box 313 is greater than the first inter-box distance threshold, the upward traversal is stopped. Next, with the merged text box 313 as a starting position, that is, with the merged text box 313 traversed first in the merged text box 314 and the merged text box 313 as a starting position, the merged text box is traversed downward until the merged text box 311 where the key information zip code "10400" is located is found, and the content traversed downward is obtained as the order information to be corrected.
In the embodiment of the present disclosure, there is no limitation on the relationship between the directions in which the first target direction and the second target direction point, that is, the first target direction and the second target direction may be at an angle, for example, the first target direction and the second target direction may be opposite (i.e., 180 °), or the first target direction and the second target direction may be the same (i.e., 0 °).
In one example, when the key information is located in the beginning portion of the order information to be corrected, the first target direction may indicate to traverse the merged text box downward, by traversing the merged text box downward until the key information is found, or find an adjacent merged text box whose distance satisfies a first set condition. And under the condition that the key information is positioned at the beginning part of the order information to be corrected, the first target direction is the same as the second target direction, traversing is performed again in the traversing region according to the second target direction, and the traversed content is obtained as the order information to be corrected.
In some embodiments, the adjacent merged text box is taken as a target adjacent merged text box, and the first inter-box distance threshold corresponding to the target adjacent merged text box is determined according to at least one of the following: the height of the first traversed merged text box in the target adjacent merged text boxes; the distance between the merged text boxes contained in the traversed adjacent merged text box and the height of the merged text box traversed first. Wherein the target adjacent merged text box is two adjacent merged text boxes for which a first inter-box distance threshold is to be determined. In an embodiment of the present disclosure, the first inter-box distance threshold corresponding to each pair of adjacent merged text boxes may be different.
In one example, the first inter-box distance threshold is determined according to a height of a first traversed one of the target adjacent merged text boxes.
Taking the first inter-frame distance threshold corresponding to the adjacent merged text boxes 311 and 312 in fig. 3B as an example, since each merged text box is traversed from bottom to top in the process of locating the area where the order information to be corrected is located, the adjacent merged text boxes 311 and 312 are the adjacent merged text boxes traversed first in this example, the first inter-frame distance threshold corresponding to the two may be determined according to the height of the merged text box 311. For example, the first inter-box distance threshold is set to 0.65 mean height1 (the height of the merged text box 311).
In one example, the first inter-box distance threshold may be determined based on a distance between the merged text boxes contained by the traversed adjacent merged text boxes and a height of the merged text box traversed first. The first traversed combined text box is the first traversed combined text box in the process of positioning the area where the order information to be corrected is located.
Taking the first inter-box distance threshold corresponding to the adjacent merged text boxes 312 and 313 in fig. 3B as an example, the first inter-box distance threshold corresponding to the two may be determined according to the distance between the traversed adjacent merged text boxes 311 and 312 and the height of the merged text box 311 traversed first. For example, the first inter-frame distance threshold value threshold is set to mean1_ distance + standard1_ deviation, where mean1_ distance represents the distance between adjacent merged text frames 311 and 312, standard1_ deviation represents the corresponding disturbance value of the merged text frames 311 and 312, standard1_ deviation equals 0.25 height1, and height1 is, for example, the height of the merged text frame 311.
In the case where more than one pair of adjacent text boxes have been traversed, taking the first inter-box distance threshold corresponding to adjacent text boxes 313 and 314 in fig. 3B as an example, the first inter-box distance threshold corresponding to target adjacent text boxes 313 and 314 may be determined based on the distance between traversed adjacent merged text boxes 311 and 312, the distance between adjacent merged text boxes 312 and 313, and the height of the first traversed merged text box 311.
In one example, a first inter-box distance threshold corresponding to the target adjacent merged text box may be determined by: acquiring an update inter-frame distance of the target adjacent merged text box, wherein the update inter-frame distance is obtained by performing weighted summation on a distance between merged text boxes contained in reference adjacent merged text boxes and an update inter-frame distance between merged text boxes contained in the reference adjacent merged text boxes, and the reference adjacent text box is an adjacent merged text box closest to the target merged text box; acquiring an updated disturbance value of the target adjacent merged text box, wherein the updated disturbance value is obtained by performing weighted summation on a disturbance value of the first traversed adjacent merged text box and an absolute value of a distance difference value, the distance difference value is a difference between a distance between updated boxes of the target adjacent merged text box and a distance between merged text boxes contained in the reference adjacent merged text box, and the disturbance value is determined according to the height of the first traversed merged text box; and determining a first inter-frame distance threshold value of the target adjacent merged text box according to the updated inter-frame distance and the updated perturbation value.
Still taking the first inter-box distance threshold corresponding to the adjacent text boxes 313 and 314 in FIG. 3B as an example, the updated inter-box distance corresponding to the adjacent text boxes 313 and 314 is obtained first
new _ mean ═ 0.6 mean _ distance +0.4 mean2_ distance; where mean _ distance is an updated inter-box distance between the merged text boxes included in the reference neighboring merged text boxes 312 and 313. In this example, the updated inter-box distance corresponding to each adjacent merged text box is obtained in the same manner except for the adjacent merged text box that is traversed first. And the distance between the updating boxes corresponding to the adjacent first traversed merging text boxes is the distance between the included merging text boxes. Next, an updated disturbance value new _ deviation of 0.6 × standard1_ deviation +0.4 × abs (mean2_ distance-new _ mean) is obtained, where standard1_ deviation represents the corresponding disturbance value of the merged text boxes 311 and 312, which is, for example, the height of the merged text box 311, and the mean2_ distance, new _ mean are defined as above. Finally, a first inter-frame distance threshold corresponding to the target adjacent merged text boxes 313 and 314 is determined according to the updated inter-frame distance and the updated perturbation value obtained above.
It will be understood by those skilled in the art that the above numerical values of the respective parameters are only for example and are not intended to be limiting, and the numerical values of the respective parameters and the weighting coefficient values may be determined according to actual needs.
For the multiple merged text boxes shown in fig. 3B, by applying the method of determining the first inter-frame distance threshold described above, during the upward traversal from the merged text box 311, it is detected that the distance between the merged text box 314 and the merged text box 313 is greater than the corresponding first inter-frame distance threshold, so the traversal is stopped, and then the merged text box 313 traversed first from the merged text box 314 and the merged text box 313 is taken as a starting position, and each merged text box is traversed downward until the merged text box 311 where the key information is located is found, and the content obtained by the downward traversal is obtained.
In the embodiment of the present disclosure, the disturbance value is set for the distance threshold, and the current distance threshold is updated according to the traversed distance between the adjacent merged text boxes and the merged text box traversed first, so that the fault tolerance of the information extraction method provided in the embodiment of the present disclosure is improved, and the order information to be corrected can be extracted more effectively.
In some embodiments, after the order information to be corrected is extracted, according to the target direction, and according to the position relationship with the area where the order information to be corrected is located, the subject name corresponding to the order information to be corrected may be determined from the combined text box outside the area where the order information to be corrected is located.
In the files with various formats, the text box closest to the extracted region where the target region is located is the text box of the subject name corresponding to the order information to be corrected. Taking the partial screenshot of the hotel order shown in fig. 3B as an example, it can be seen that the text box above the extracted address information is the name of the hotel, which is the main body of the address information. The same applies to files such as business cards and shopping orders, and the text box closest to the area where the address information, the identity information, and the like are located is the text box where the name of the main body of the information is located.
In one example, the subject name corresponding to the order information to be corrected may be determined by the following method.
Firstly, determining a combined text box which is closest to the area where the order information to be corrected is located in the first target direction; traversing the merged text box according to the first target direction by taking the merged text box as an initial position until an adjacent merged text box with a distance meeting a second set condition is found; and traversing the merged text box beyond the area where the order information to be corrected is located according to the second target direction by taking the first traversed merged text box as an initial position in the adjacent merged text boxes with the distance meeting a second set condition, and acquiring the traversed content according to the second target direction.
Taking the merged text box shown in fig. 3C as an example, the content contained in the merged text boxes 321 to 322 is the order information to be corrected extracted according to the order information correction method described in any embodiment of the present disclosure, and the area where the merged text boxes 321 to 322 are located may be determined as the area where the order information to be corrected is located. Among the merged text boxes determined according to the text recognition result, except for the merged text boxes 321 to 322, the merged text box that is closest to the area where the order information to be corrected is located in the first target direction (the direction of search traversal, in this example, upward) is 323 (characters in a non-target language exist between the merged text box 322 and the merged text box 323, and are ignored as shown in the gray part). The merge text box is traversed up, starting with merge text box 323. Since the distance between the adjacent merged text box above the merged text box 323 and the merged text box 323 exceeds the second inter-box threshold, that is, the second setting condition is satisfied (in the case that there is no other merged text box above the merged text box 323, the second setting condition is also considered to be satisfied), the merged text box 323 is taken as the starting position, and the merged text box outside the area where the order information to be corrected is located is traversed downwards, that is, the merged text box 323 in this example, so that the content "XXXXXX Hotel" in the merged text box can be determined as the name of the subject of the order information to be corrected, that is, the "XXXXXX Hotel" is determined as the name of the subject of the order information to be corrected.
In some embodiments, when traversing the merged text box in the first target direction with the merged text box as a starting position, ignoring the merged text box that is not above the area where the target is located, that is, ignoring the merged text box that does not intersect with the merged text box where the order information to be corrected is located in the horizontal direction.
In one example, where there is "no" ("contained in the merged text box traversed") then the distance condition between adjacent merged text boxes may be ignored, and the merged text box may continue to be traversed in the first target direction until "(", and then whether to stop traversing may be determined based on the distance condition between adjacent merged text boxes.
In one example, where a full bracket "()" is included in the currently traversed merge text box, or no bracket is included, the second inter-box distance threshold may be set to 0.6 mean height (average height of adjacent merge text boxes). Those skilled in the art will appreciate that the above coefficient settings are examples and are not limited by the present disclosure.
The information extraction method provided by any embodiment of the disclosure can be applied to images or electronic documents of various formats, and at least comprises one of the following items: an image or electronic document (e.g., pdf document) of a hotel order, airplane itinerary, passport, identification card, etc. By applying the information extraction method to the images or electronic documents of various formats, order information to be corrected of corresponding types contained in the images or electronic documents can be extracted, and the order information to be corrected at least comprises the following items: address information, trip information, identity information, and the like.
Fig. 4 is a device for correcting order information according to at least one embodiment of the present disclosure, where the device includes: an obtaining unit 401, configured to obtain order information to be corrected according to a text recognition result of an order; a determining unit 402, configured to determine target search information from the text recognition result; a matching unit 403, configured to obtain order reference information matched with the target search information in a preset search manner; a correcting unit 404, configured to correct the order information to be corrected by using the order reference information to obtain target order information.
In some embodiments, the target search information includes at least one of: the target search information comprises partial content of the order information to be corrected; the partial content includes at least one of a subject name, at least one element.
In some embodiments, the matching unit is specifically configured to at least one of: accessing a setting database to acquire order reference information matched with the target search information from the setting database; and obtaining order reference information matched with the target search information through the Internet.
In some embodiments, the setup database includes a plurality of levels of reference cell information, and the reference cell information of a lowest level of the plurality of levels corresponds to a plurality of reference subject names.
In some embodiments, the setting database stores first reference information corresponding to a reference subject name; the determining unit is specifically configured to: acquiring unit information of the lowest level in the order information to be corrected according to the level division in the set database; the obtaining of the order reference information matched with the target search information from the setting database includes: determining target unit information matched with the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level in the set database; determining a target subject name which meets a preset condition in a plurality of reference subject names corresponding to the target unit information; and obtaining order reference information matched with the target search information according to the first reference information corresponding to the target subject name.
In some embodiments, the setting database stores second reference information corresponding to a reference subject name; the matching unit is specifically configured to: acquiring unit information of the lowest level in the order information to be corrected according to the level division in the set database; determining target unit information matched with the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level in the set database; determining a target subject name which meets a preset condition in a plurality of reference subject names corresponding to the target unit information; and obtaining order reference information matched with the target search information according to the reference unit information of each hierarchy corresponding to the target subject name and the second reference information corresponding to the target subject name.
In some embodiments, the matching unit is configured to determine, among a plurality of reference subject names corresponding to the target unit information, a target subject name that meets a preset condition, and specifically configured to: matching the subject names corresponding to the order information to be corrected with a plurality of reference subject names corresponding to the target unit information respectively; and determining the reference subject name with the highest matching score and exceeding a first set threshold value as the target subject name.
In some embodiments, the matching unit is specifically configured to: searching in the Internet according to partial content of the order information to be corrected to obtain at least one piece of reference information matched with the target search information; matching reference information corresponding to the target search information with the order information to be corrected; and acquiring the order reference information with the highest matching score and exceeding a second set threshold.
In some embodiments, the apparatus further includes an adding unit, configured to add the order reference information acquired from the internet and a subject name corresponding to the order information to be corrected to information corresponding to a reference unit information at a lowest level in the setting database.
In some embodiments, the apparatus further includes an updating unit, configured to update information corresponding to reference unit information at a lowest level in the setting database according to the order reference information acquired from the internet and a subject name corresponding to the order information to be corrected.
In some embodiments, the order information to be corrected includes at least address information, and the at least one element included in the order information to be corrected includes at least one of: the system comprises a setting database, administrative areas and postal codes, wherein the reference unit information of a plurality of levels included in the setting database comprises reference administrative area information or postal code information.
In some embodiments, the obtaining unit is specifically configured to: acquiring a text recognition result of the object to be processed, wherein the text recognition result comprises a plurality of text boxes; determining a first text box containing key information from the plurality of text boxes, wherein the key information comprises partial content of the order information to be corrected, and the partial content comprises at least one element in the order information to be corrected and at least one of keywords indicating the order information to be corrected; combining at least part of the text boxes according to the first text box to obtain a combined text box; and acquiring the order information to be corrected from the combined text box.
According to an aspect of the present disclosure, an electronic device is provided, which includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement a method for correcting order information according to any embodiment of the present disclosure when executing the computer instructions.
According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for correcting order information according to any one of the embodiments of the present disclosure.
According to the order information correction method, device, equipment and storage medium of one or more embodiments of the present disclosure, order information to be corrected is obtained according to a text recognition result of an order, target search information is determined from the text recognition result, order reference information matched with the target search information is obtained in a preset search mode, the order reference information is used to correct the order information to be corrected to obtain the target order information, and accurate target order information can be quickly obtained from the text recognition result of the order.
Fig. 5 is an electronic device provided in at least one embodiment of the present disclosure, and the electronic device includes a memory and a processor, where the memory is used to store computer instructions executable on the processor, and the processor is used to implement the method for correcting order information according to any embodiment of the present disclosure when executing the computer instructions.
At least one embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the method for correcting order information according to any one of the embodiments of the present disclosure.
As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims (15)

1. A method for correcting order information, which is characterized in that the method comprises the following steps:
obtaining order information to be corrected according to a text recognition result of the order;
determining target search information from the text recognition result;
acquiring order reference information matched with the target search information in a preset search mode;
and correcting the order information to be corrected by using the order reference information to obtain target order information.
2. The method according to claim 1, wherein the target search information includes a part of the content of the order information to be corrected;
the partial content includes at least one of a subject name, at least one element.
3. The method according to claim 1 or 2, wherein the obtaining of the order reference information matched with the target search information by a preset search mode includes at least one of the following:
accessing a setting database to acquire order reference information matched with the target search information from the setting database;
and obtaining order reference information matched with the target search information through the Internet.
4. The method of claim 3, wherein the configuration database comprises a plurality of levels of reference cell information, and the reference cell information of the lowest level of the plurality of levels corresponds to a plurality of reference subject names.
5. The method according to claim 4, wherein the setting database stores first reference information corresponding to a reference subject name;
determining target search information from the text recognition result, including:
acquiring unit information of the lowest level in the order information to be corrected according to the level division in the set database;
the obtaining of the order reference information matched with the target search information from the setting database includes:
determining target unit information matched with the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level in the set database;
determining a target subject name which meets a preset condition in a plurality of reference subject names corresponding to the target unit information;
and obtaining order reference information matched with the target search information according to the first reference information corresponding to the target subject name.
6. The method according to claim 4, wherein the setting database stores second reference information corresponding to a reference subject name;
determining target search information from the text recognition result, including:
acquiring unit information of the lowest level in the order information to be corrected according to the level division in the set database;
the obtaining of the order reference information matched with the target search information from the setting database includes:
determining target unit information matched with the unit information of the lowest level in the order information to be corrected in the reference unit information of the lowest level in the set database;
determining a target subject name which meets a preset condition in a plurality of reference subject names corresponding to the target unit information;
and obtaining order reference information matched with the target search information according to the reference unit information of each hierarchy corresponding to the target subject name and the second reference information corresponding to the target subject name.
7. The method according to claim 5 or 6, wherein the determining a target subject name meeting a preset condition from a plurality of reference subject names corresponding to the target unit information comprises:
matching the subject names corresponding to the order information to be corrected with a plurality of reference subject names corresponding to the target unit information respectively;
and determining the reference subject name with the highest matching score and exceeding a first set threshold value as the target subject name.
8. The method according to any one of claims 3 to 7, wherein the obtaining of the order reference information matching the target search information via the internet comprises:
searching in the Internet according to partial content of the order information to be corrected to obtain at least one piece of reference information matched with the target search information;
matching reference information corresponding to the target search information with the order information to be corrected;
and acquiring the order reference information with the highest matching score and exceeding a second set threshold.
9. The method of claim 8, further comprising:
and adding the order reference information acquired from the Internet and the subject name corresponding to the order information to be corrected into the information corresponding to the reference unit information of the lowest level in the set database.
10. The method of claim 8, further comprising:
and updating the information corresponding to the reference unit information of the lowest level in the set database according to the order reference information acquired from the Internet and the subject name corresponding to the order information to be corrected.
11. The method according to any one of claims 4 to 10, wherein the order information to be corrected includes at least address information, and at least one element included in the order information to be corrected includes at least one of: the system comprises administrative areas and postal codes, and the reference unit information of multiple levels included in the set database comprises reference administrative area information and/or postal code information.
12. The method according to claim 11, wherein the obtaining the order information to be corrected according to the text recognition result of the order comprises:
acquiring a text recognition result of the order, wherein the text recognition result comprises a plurality of text boxes;
determining a first text box containing key information from the plurality of text boxes, wherein the key information comprises partial content of the order information to be corrected, and the partial content comprises at least one element in the order information to be corrected and at least one of keywords indicating the order information to be corrected;
combining at least part of the text boxes according to the first text box to obtain a combined text box;
and acquiring the order information to be corrected from the combined text box.
13. An apparatus for correcting order information, the apparatus comprising:
the acquisition unit is used for acquiring the information of the order to be corrected according to the text recognition result of the order;
a determination unit configured to determine target search information from the text recognition result;
the matching unit is used for acquiring order reference information matched with the target search information in a preset search mode;
and the correcting unit is used for correcting the order information to be corrected by using the order reference information to obtain target order information.
14. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 12 when executing the computer instructions.
15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 12.
CN202011339777.2A 2020-11-25 2020-11-25 Order information correction method, device, equipment and storage medium Pending CN112395874A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011339777.2A CN112395874A (en) 2020-11-25 2020-11-25 Order information correction method, device, equipment and storage medium
PCT/IB2021/055848 WO2022112857A1 (en) 2020-11-25 2021-06-30 Method and apparatus for correcting order information, and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011339777.2A CN112395874A (en) 2020-11-25 2020-11-25 Order information correction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112395874A true CN112395874A (en) 2021-02-23

Family

ID=74603919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011339777.2A Pending CN112395874A (en) 2020-11-25 2020-11-25 Order information correction method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112395874A (en)
WO (1) WO2022112857A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114120322A (en) * 2022-01-26 2022-03-01 深圳爱莫科技有限公司 Order commodity quantity identification result correction method and processing equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090248671A1 (en) * 2008-03-28 2009-10-01 Daisuke Maruyama Information classification system, information processing apparatus, information classification method and program
CN110442702A (en) * 2019-08-15 2019-11-12 北京上格云技术有限公司 Searching method, device, readable storage medium storing program for executing and electronic equipment
CN110674396A (en) * 2019-08-28 2020-01-10 北京三快在线科技有限公司 Text information processing method and device, electronic equipment and readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050137991A1 (en) * 2003-12-18 2005-06-23 Bruce Ben F. Method and system for name and address validation and correction
WO2009005492A1 (en) * 2007-06-29 2009-01-08 United States Postal Service Systems and methods for validating an address
CN107239453B (en) * 2016-03-28 2020-10-02 平安科技(深圳)有限公司 Information writing method and device
CN109784235A (en) * 2018-12-29 2019-05-21 广东益萃网络科技有限公司 Method for automatically inputting, device, computer equipment and the storage medium of paper form

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090248671A1 (en) * 2008-03-28 2009-10-01 Daisuke Maruyama Information classification system, information processing apparatus, information classification method and program
CN110442702A (en) * 2019-08-15 2019-11-12 北京上格云技术有限公司 Searching method, device, readable storage medium storing program for executing and electronic equipment
CN110674396A (en) * 2019-08-28 2020-01-10 北京三快在线科技有限公司 Text information processing method and device, electronic equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114120322A (en) * 2022-01-26 2022-03-01 深圳爱莫科技有限公司 Order commodity quantity identification result correction method and processing equipment
CN114120322B (en) * 2022-01-26 2022-05-10 深圳爱莫科技有限公司 Order commodity quantity identification result correction method and processing equipment

Also Published As

Publication number Publication date
WO2022112857A1 (en) 2022-06-02

Similar Documents

Publication Publication Date Title
EP3971731A1 (en) Fence address-based coordinate data processing method and apparatus, and computer device
CN109145281B (en) Speech recognition method, apparatus and storage medium
US7769778B2 (en) Systems and methods for validating an address
CN110909725A (en) Method, device and equipment for recognizing text and storage medium
US9645979B2 (en) Device, method and program for generating accurate corpus data for presentation target for searching
CN109739997B (en) Address comparison method, device and system
US20120102002A1 (en) Automatic data validation and correction
CN110516011B (en) Multi-source entity data fusion method, device and equipment
US20190114313A1 (en) User interface for contextual document recognition
CN111522901B (en) Method and device for processing address information in text
CN107748778B (en) Method and device for extracting address
CN111652176B (en) Information extraction method, device, equipment and storage medium
CN111931077A (en) Data processing method and device, electronic equipment and storage medium
CN112395418A (en) Method and device for extracting target object in webpage and electronic equipment
CN115470307A (en) Address matching method and device
CN112395874A (en) Order information correction method, device, equipment and storage medium
WO2009005492A1 (en) Systems and methods for validating an address
JP5192413B2 (en) Data integration apparatus and data integration method
CN113761137A (en) Method and device for extracting address information
CN111178349A (en) Image identification method, device, equipment and storage medium
US20220121881A1 (en) Systems and methods for enabling relevant data to be extracted from a plurality of documents
CN112396056B (en) Method for high-accuracy line division of text image OCR result
JPH08221510A (en) Device and method for processing form document
CN115185986A (en) Method and device for matching provincial and urban area address information, computer equipment and storage medium
JP2655087B2 (en) Character recognition post-processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40039018

Country of ref document: HK