WO2020140698A1 - 表格数据的获取方法、装置和服务器 - Google Patents

表格数据的获取方法、装置和服务器 Download PDF

Info

Publication number
WO2020140698A1
WO2020140698A1 PCT/CN2019/124101 CN2019124101W WO2020140698A1 WO 2020140698 A1 WO2020140698 A1 WO 2020140698A1 CN 2019124101 W CN2019124101 W CN 2019124101W WO 2020140698 A1 WO2020140698 A1 WO 2020140698A1
Authority
WO
WIPO (PCT)
Prior art keywords
morphological
rectangular
coordinates
image data
image
Prior art date
Application number
PCT/CN2019/124101
Other languages
English (en)
French (fr)
Inventor
张林江
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020140698A1 publication Critical patent/WO2020140698A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Definitions

  • This specification belongs to the field of Internet technology, and particularly relates to a method, device and server for acquiring table data.
  • such a type of text data for example, contract documents
  • the data acquisition method is usually to directly perform optical character recognition on image data such as scanned pictures containing text data to recognize and extract text information in the image data to obtain electronic file data of the corresponding text.
  • the table data in the text data is different from the above-mentioned individual text characters.
  • it also has certain graphic features, for example, including dividers and dividers.
  • the structure of the table data is more complicated and it is more difficult to recognize.
  • the existing data acquisition method is used to identify the table data in the image data, errors are likely to occur.
  • the dividers in the table are mistakenly recognized as numbers.
  • the text characters in the N rows and M columns of the table are misaligned and so on. Therefore, there is an urgent need for a method that can accurately identify and completely recover the table data in the image data.
  • the purpose of this specification is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images
  • the content of the table in the data is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images
  • the content of the table in the data is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images
  • the content of the table in the data is to provide a method, device and server for acquiring form data to solve the technical problem of large error and inaccuracy in the existing method of extracting form data, so as to achieve efficient and accurate identification and complete restoration to obtain images
  • the content of the table in the data is
  • a method for acquiring form data comprising: acquiring image data of text to be processed; extracting a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines ; Divide the combined image into multiple rectangular units, wherein the multiple rectangular units each carry position coordinates; perform optical character recognition on the multiple rectangular units, and determine whether the multiple rectangular units contain Text information; according to the position coordinates of the rectangular unit, the rectangular unit containing the text information is combined to obtain the table data.
  • An apparatus for acquiring form data includes: an acquiring module for acquiring image data of text to be processed; an extracting module for extracting a combined image from the image data, wherein the combined image is a form that includes a cross Learning vertical and morphological horizontal lines; a segmentation module for dividing the combined graph into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; an identification module is used for the A plurality of rectangular units respectively perform optical character recognition to determine the text information contained in each of the plurality of rectangular units; a combination module is used to combine the rectangular units containing text information according to the position coordinates of the rectangular units to obtain table data.
  • a server includes a processor and a memory for storing processor-executable instructions.
  • the processor executes the instructions, the image data of the text to be processed is obtained; the combined image is extracted from the image data, wherein
  • the combination graph is a graph including vertical morphological lines and horizontal morphological lines; the combination graph is divided into a plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; Each rectangular unit performs optical character recognition to determine the text information contained in the multiple rectangular units; according to the position coordinates of the rectangular units, the rectangular units containing the text information are combined to obtain table data.
  • the method, device and server for acquiring table data provided in this specification, because the combined image is obtained by first obtaining and extracting from the graphic features of the morphological vertical line and the morphological horizontal line in the image data; then the combined image is divided into multiple Each rectangular unit is divided into optical characters to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined according to the position coordinates of the rectangular unit to restore the complete table data. Therefore, the technical problem of large error and inaccuracy in extracting table data existing in the existing method is solved, and the content of the table in the image data can be identified efficiently and accurately, and the table content in the image data is completely restored.
  • FIG. 1 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
  • FIG. 2 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
  • FIG. 3 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
  • FIG. 4 is a schematic diagram of an embodiment of a method for acquiring table data provided by an embodiment of this specification in a scenario example;
  • FIG. 5 is a schematic diagram of an embodiment of a flow of a method for acquiring table data provided by an embodiment of this specification
  • FIG. 6 is a schematic diagram of an embodiment of a structure of a server provided by an embodiment of this specification.
  • FIG. 7 is a schematic diagram of an embodiment of a structure of an apparatus for acquiring table data provided by an embodiment of this specification.
  • a graphic structure such as a separator bar in the table data is mistakenly recognized as a text character, or a misalignment occurs in the recognition and extraction of text information at different positions in the table data. That is, when the table data in the image data is processed by the existing acquisition method, the effect is often not ideal, and there is a technical problem of large error and inaccuracy in extracting the table data.
  • this specification specifically analyzes the different characteristics of the two different attribute objects of text characters and graphic structures that the table data has at the same time.
  • image structure features such as lines to find a combined image that may form table data from the image data; then divide the combined image into multiple rectangular units, and perform optical character recognition on each rectangular unit separately to obtain the text information of the rectangular unit;
  • the rectangular unit containing the text information is combined to restore and reconstruct the complete table data of the image, thereby solving the technical problem of large error and inaccuracy in the existing method of extracting table data. It can efficiently and accurately identify and completely restore the table content in the image data.
  • the embodiments of the present specification provide an acquisition method of a table data method.
  • the acquisition method of the table data may be specifically applied to an image data processing system including multiple servers.
  • the legal contract processing system for scanning pictures For example, the legal contract processing system for scanning pictures.
  • the above system may specifically include a server for identifying and acquiring form data in text data from image data.
  • the server can extract the combined image from the acquired image data of the text to be processed by detecting the morphological vertical lines and morphological horizontal lines in the image data; then divide the combined image according to the coordinates Into multiple rectangular units, and perform optical character recognition on each of the multiple rectangular units to identify and determine the text information contained in each rectangular unit; then, according to the coordinates of the rectangular unit, combine and splice the above contained text The rectangular unit of information to get the complete table data.
  • the server can be understood as a service server that is applied to the business system side and can implement functions such as data transmission and data processing.
  • the server may be an electronic device with data calculation, storage, and network interaction functions; or a software program that runs on the electronic device and provides support for data processing, storage, and network interaction.
  • the number of the servers is not specifically limited.
  • the server may specifically be one server, or several servers, or a server cluster formed by several servers.
  • the form data acquisition method provided in the embodiment of the present specification can be used to process the image data containing the contract received by the legal platform to extract the form data in the contract.
  • the legal platform can distribute the image data containing the contract to be entered by the user to the server on the platform that is used to obtain the form data.
  • the above-mentioned legal platform can be specifically used to identify and extract text information in user-uploaded image data containing contracts (such as scanned pictures or photos containing contracts) to convert contract contents into electronic file data.
  • contracts such as scanned pictures or photos containing contracts
  • the above-mentioned legal platform can be specifically used to identify and extract text information in user-uploaded image data containing contracts (such as scanned pictures or photos containing contracts) to convert contract contents into electronic file data.
  • contracts such as scanned pictures or photos containing contracts
  • the server may refer to FIG. 2 to pre-process the image to reduce error interference and improve the accuracy of subsequent identification and acquisition of table data.
  • the server may be specifically configured with OpenCV (that is, Open source Computer Vision Library, source code computer vision library).
  • OpenCV Open source Computer Vision Library, source code computer vision library
  • the above OpenCV can be understood as an API function library about the source code of computer vision.
  • the function code contained in the library has been optimized, and the efficiency of calling and calculating is relatively high.
  • the server can call the corresponding function code through the above OpenCV to efficiently perform data processing on the image data.
  • the server can first convert the image data to obtain the corresponding grayscale image, and then perform Gaussian smoothing on the grayscale image to filter out the more obvious noise information in the grayscale image and improve the accuracy of the image data, thereby completing Preprocessing of image data.
  • the image data is converted into a grayscale image only as an example for schematic description.
  • the image data may also be converted into a binary map first, and then subsequent table data acquisition may be performed based on the binary map. This specification is not limited.
  • the server can first scan and retrieve the graphic structural features (such as structural elements, etc.) in the image data based on morphology, so as to find the difference from the image data first.
  • Text characters, with certain graphic features, may form a table of graphics: combination chart.
  • a specific frame image in the image data is taken as an example, for example, the fifth page image in the image data including the contract is taken as an example.
  • the server can scan and search the morphological vertical line and the morphological horizontal line in the frame image.
  • the above-mentioned morphological vertical lines and morphological horizontal lines can be understood as a structural element related to graphics that is different from text characters. You can refer to Figure 3.
  • the morphological vertical line may specifically be an image unit or a structural element that contains a straight line segment along the vertical direction in the image.
  • the above-mentioned morphological horizontal line may specifically be an image unit or a structural element that contains a straight line segment along the horizontal direction in the image.
  • the server can search for the structural elements in the image by calling the getStructuringElement function, and find all the morphological vertical lines and morphological horizontal lines from it.
  • the above-listed method of obtaining the morphological vertical line and the morphological horizontal line from the image by calling the getStructuringElement function is only a schematic illustration.
  • the morphological vertical line and the morphological horizontal line in the image may also be obtained in other suitable ways. This specification is not limited.
  • each morphological horizontal line mostly intersects one or more of the morphological vertical lines. Therefore, after obtaining the morphological vertical line and the morphological horizontal line in the frame image, the server can further search for the graph containing the structure of the intersecting morphological vertical line and the morphological horizontal line as possible form data Combining graphs to avoid subsequent processing of graphic structures that obviously do not have the graphic features of table data and improve processing efficiency.
  • the morphological horizontal lines and morphological vertical lines can be directly extracted on the original image, and the extracted morphology Horizontal lines and morphological vertical lines cover the extraction position.
  • the combination chart After obtaining the above-mentioned combination chart with more obvious data characteristics of the data table and possibly forming the table data, the combination chart can be further inspected, by checking whether the combination chart meets the preset table format requirements, to be more accurate To determine whether the combination chart is a data table.
  • the above-mentioned preset table format requirements can be specifically understood as a rule set for describing graphic features of data tables different from other graphic structures.
  • each grid graphic (or rectangular frame, see Figure 3) is designed to fill in specific characters, that is, each grid graphic in the data table
  • the minimum area should be able to accommodate at least the next complete character. Therefore, the following rules for graphic area characteristics may be set: the minimum area of the grid pattern in the data table should be greater than a preset area threshold. Also considering that based on people's usual typesetting habits, the table data will be set to the center position when editing the table data. Therefore, you can also set the following rules for graphic position features: the absolute value of the difference between the distance between the left border of the data table and the left border of the image and the distance between the right border of the data table and the right border of the image is less than the Set the distance threshold.
  • the following rules for the quantity characteristics of graphics may also be set: the number of grid graphics in the data table is greater than or equal to a preset quantity threshold (for example, 2), and so on.
  • the extracted combination map in order to determine whether the extracted combination map meets the preset table format requirements, in specific implementation, it can first retrieve the point where the horizontal and vertical morphological lines in the combination map are at the same image position as Intersection point, and then determine the position coordinates of each intersection point in the combined image in the frame image.
  • intersection point can be specifically understood as the pixel point at the position where the morphological vertical line and the morphological horizontal line intersect in the combined image in the frame image. See Figure 3 for details.
  • the server can search for and obtain the coordinates of the intersection point in the combined image in the image by calling the opencv bitwise_and function.
  • the opencv bitwise_and function listed above is only a schematic illustration.
  • the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.
  • the server may further search for the graphic structure elements of the above combination diagram, and find a graphic element having a rectangular (or square) structure (ie, a grid in the corresponding table) as a rectangular frame in the combination diagram.
  • a graphic element having a rectangular (or square) structure ie, a grid in the corresponding table
  • the server may search for and obtain the rectangular frame in the combination graph by calling the findContours function.
  • the findContours function is only a schematic illustration.
  • the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.
  • the server may determine the endpoint coordinates at the four endpoints of each rectangular frame in the combination graph through position comparison based on the determined intersection coordinate and the rectangular frame in the combination graph. Furthermore, according to the coordinates of the endpoints of the rectangular frame in the combination diagram, it can be determined whether the combination diagram meets the preset table format requirements.
  • the server may calculate the length and width of the rectangular frame according to the coordinates of the endpoints of the rectangular frame, and then calculate the area of the rectangular frame based on the length and width. Then compare the area of the rectangular frame with the preset area threshold. If the area of each rectangular frame in the combination diagram is greater than the preset area threshold, it can be determined that the combination diagram meets the preset table format requirements.
  • the server can also compare the value of the abscissa of the end point coordinates of each rectangular frame in the combination diagram, find the end point with the smallest value of the abscissa as the endpoint on the left border of the combination diagram, and determine the abscissa of the endpoint as the left The abscissa of the border, and then calculate the distance between the left border of the combined image and the left border of the image based on the abscissa of the left border, and record it as d1.
  • the service finds the endpoint with the largest abscissa value as the endpoint on the right border of the combination chart by comparing the values of the abscissa of the endpoint, and determines the abscissa of the endpoint as the abscissa of the right border.
  • the abscissa of the side boundary calculates the distance between the right boundary of the combined drawing and the right boundary of the drawing, and is denoted as d2.
  • the server may calculate the absolute value of the difference between d1 and d2, and compare the absolute value of the above difference with a preset distance threshold. If the absolute value of the above-mentioned difference is less than or equal to the preset distance threshold, it can be determined that the entire combination picture is located at the center of the image, that is, the preset table format requirements are met.
  • the server may determine that the currently extracted combination diagram is indeed a data table in the image. Subsequent text information can be extracted from the combined image.
  • the server may first divide the above combined image into a plurality of rectangular units.
  • each rectangular unit corresponds to a rectangular frame in the combination diagram one by one; however, it is different from the single graphical structure element of the rectangular frame.
  • Each rectangular unit contains text characters or blank state information.
  • separate optical character recognition can be performed on each rectangular unit to accurately identify the text characters in the rectangular unit and determine the text information contained in each rectangular unit.
  • the server may first determine the contour line enclosing the rectangular frame as the dividing line according to the endpoint coordinates of the rectangular frame, and then may cut along the contour line to divide the rectangular unit corresponding to the rectangular frame from the combined diagram. For example, see Figure 4.
  • the coordinates of the four endpoints of a rectangular frame in the combination diagram are A (15, 60), B (15, 40), C (30, 40), and D (30, 60).
  • the server can start from the endpoint A, keep the abscissa 15 unchanged, and find the endpoint with a different ordinate, namely endpoint B, and then connect endpoint A to endpoint B according to a preset division rule.
  • the server starts from the endpoint B, keeps the ordinate 40 unchanged, and finds the endpoint with different abscissas, that is, the endpoint C, and then connects the endpoint B to the endpoint C according to the preset division rule.
  • the server starts from the endpoint C, keeps the abscissa 30 unchanged according to the preset division rule, and finds the endpoint with a different ordinate, namely the endpoint D, and then connects the endpoint C to the endpoint D.
  • the server starts from the endpoint D and keeps the ordinate 60 unchanged according to the preset division rule, and finds the endpoint with different abscissas, that is, endpoint A, and then connects the endpoint D to the endpoint A.
  • a closed connecting line can be obtained: A to B to C to D to A, which is the outline of the rectangular frame.
  • the server may use the outline as a dividing line, and divide the rectangular frame containing the text information in the combined image along the outline to obtain the corresponding rectangular unit.
  • each rectangular unit in the combined graph can be divided.
  • the above-mentioned manner of dividing the rectangular unit is just to better explain the embodiments of the present specification.
  • other suitable methods may also be used to divide a plurality of rectangular units from the combined diagram according to specific circumstances. This specification is not limited.
  • the server in the process of dividing the combined image, also generates position coordinates corresponding to the rectangular unit according to the coordinates of the end points of the rectangular frame.
  • the above position coordinates can be understood as a kind of parameter data used to indicate the position of the rectangular unit in the image of the combined image or describe the positional relationship between the rectangular unit in the image of the combined image and other adjacent rectangular units.
  • the server may calculate the coordinates of the center point of the rectangular frame as the position coordinates of the corresponding rectangular unit according to the endpoint coordinates of the four endpoints of the rectangular frame.
  • the server may also calculate the coordinates of the center points of each rectangular frame first, and then according to the preset arrangement order, for example, from the top to bottom and from left to right, according to the coordinates of the center points of each rectangular frame, determine The row number and column number of each rectangular unit are used as the position coordinates of the corresponding rectangular unit.
  • the rectangular frame A is located in the first row and second column of the combined diagram, that is, the corresponding row number is 1 and the column number is 2, so "1-2" can be used as The position coordinates of the rectangular unit corresponding to the rectangular frame A.
  • the above-listed methods for determining the position coordinates of the rectangular unit are only schematic illustrations. During specific implementation, according to the specific situation, other suitable methods may also be used to determine the position coordinates of the rectangular unit. This specification is not limited.
  • the server can perform optical character recognition (ie, OCR, Optical, Character, Recognition) on each of the multiple rectangular units to determine the text characters in each rectangular unit, and then Determine the text information contained in each rectangular unit. If no text characters are recognized in the rectangular unit, the text information contained in the rectangular unit is left blank. In this way, multiple rectangular units containing corresponding text information can be obtained.
  • optical character recognition ie, OCR, Optical, Character, Recognition
  • the server may combine and combine the rectangular units containing the text information obtained above according to the position coordinates of each rectangular unit.
  • the rectangular unit containing text information can be set at the position of the first row and the second column according to the position coordinates "1-2" of the rectangular unit.
  • a plurality of rectangular units containing text information are sequentially set to corresponding positions, so that a complete data table can be restored.
  • the above-mentioned combination mode is only a schematic illustration. During specific implementation, other combination methods can also be used to perform combination splicing according to other types of position coordinates. This specification is not limited.
  • the server can separately detect the form data of each image in the image data containing the contract to be processed, and then obtain the form data when it is determined that the form data exists, so as to extract the complete image data Form data, and feed back the extracted form data to the legal platform, so as to organize and generate the electronic file data for the contract for storage.
  • the server obtains the After the morphological vertical line and the morphological horizontal line, further feature enhancement processing can be performed on the obtained morphological vertical line and the morphological horizontal line to make the obtained morphological vertical line and morphological horizontal line clearer.
  • the above feature strengthening treatment may specifically be a morphological treatment, and may specifically include corrosion treatment and/or expansion treatment.
  • the data value of the pixel in the middle of the area can be reset (reset to 0 or 1) by sliding the area of the convolution kernel into the frame image.
  • corrosion treatment may be performed first, followed by expansion treatment.
  • the above-mentioned corrosion processing can be understood as an AND operation. Specifically, by corroding the pixels close to the foreground according to the size of the convolution kernel (that is, resetting the value of the corresponding pixel to 0), the foreground object becomes Small, which can reduce the white area around the morphological vertical line or morphological horizontal line to achieve the effect of removing white noise; at the same time, it can also break the structural elements adjacent or even connected to the above morphological vertical line or morphological horizontal line open.
  • the morphological vertical line or the morphological horizontal line after the corrosion processing may be continuously expanded.
  • the above expansion process can be understood as an OR operation.
  • the eroded image can be enlarged and restored through expansion to obtain relatively clear morphological vertical lines and morphological horizontal lines of constant size. .
  • the method for obtaining the table data provided in this specification is due to obtaining and extracting the combined image according to the graphic features of the morphological vertical line and the morphological horizontal line in the image data; then the combined image is divided into multiple
  • the rectangular units are divided into optical characters for each rectangular unit to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving
  • an embodiment of the present specification also provides a method for acquiring table data, where the method is specifically applied to the server side.
  • the method may include the following:
  • the above-mentioned to-be-processed text may specifically be a to-be-processed contract text, a to-be-processed constitution text, or a to-be-processed specification text.
  • the image data of the text to be processed may be a scanned image containing the text content, a photo containing the text content, or a video containing the text content.
  • the specific content and form of the image data of the text to be processed above are not limited in this specification.
  • S53 Extract a combination graph from the image data, wherein the combination graph is a graph including vertical morphological lines and horizontal morphological lines.
  • the above morphological vertical line and morphological horizontal line can be specifically understood as a structural element related to graphics that is different from text characters.
  • the morphological vertical line may specifically be an image unit or a structural element that contains a straight line segment along the vertical direction in the image.
  • the above-mentioned morphological horizontal line may specifically be an image unit or a structural element that contains a straight line segment along the horizontal direction in the image.
  • the above-mentioned combined graph can be specifically understood as the image data having graphic features similar to the table data, for example, a combined graph including graphic structural elements of crossing morphological vertical lines and morphological horizontal lines.
  • the above-mentioned extraction of the combined image from the image data may include the following: search and obtain the morphological vertical line and the morphological horizontal line in the image data; connect the morphology The vertical line and the morphological horizontal line obtain the combined diagram.
  • the above search and obtain the morphological vertical line and the morphological horizontal line in the image data may include the following content: by calling the getStructuringElement function in OpenCV to search for the structural element in the image , Find the morphological vertical line and morphological horizontal line in the image data.
  • the above-listed method of obtaining the morphological vertical line and the morphological horizontal line from the image by calling the getStructuringElement function is only a schematic illustration.
  • the morphological vertical line and the morphological horizontal line in the image may also be obtained in other suitable ways. This specification is not limited.
  • the morphological vertical line and the morphological horizontal line obtained in the above manner also carry position information in the image data, and then the corresponding information can be connected according to the position information of the morphological vertical line and the morphological horizontal line The morphology vertical line and the morphology horizontal line to get the combined picture.
  • S55 Divide the combined image into a plurality of rectangular units, where the plurality of rectangular units respectively carry position coordinates.
  • the above rectangular unit can be specifically understood as an image unit that corresponds one-to-one with a rectangular frame in the combination diagram, but distinguishes the rectangular frame and contains text information (such as text characters filled or blank) .
  • each rectangular frame can be specifically understood as a rectangular or square-shaped graphic element composed of two morphological vertical lines and two morphological horizontal lines, which simply contain only graphic features.
  • each rectangular frame can be regarded as a grid in the table.
  • the combination diagram is divided into a plurality of rectangular units.
  • the following contents may be included: obtaining the coordinates of the intersection point in the combination diagram; searching and obtaining the rectangular frame in the combination diagram; according to The coordinate of the intersection point in the combined graph determines the coordinates of the end points of the rectangular frame; and according to the coordinate of the endpoints of the rectangular frame, the combined graph is divided into a plurality of rectangular units.
  • intersection point can be specifically understood as the pixel point at the position where the vertical morphological line and the horizontal morphological line in the combination figure intersect.
  • the coordinates of the intersection point in the combined graph in the image can be searched and obtained by calling the opencv bitwise_and function in OpenCV.
  • OpenCV opencv bitwise_and function
  • the rectangular frame in the combined graph can be searched and obtained by calling the findContours function in OpenCV.
  • the findContours function in OpenCV.
  • the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.
  • the above OpenCV Open source Computer Vision Library, source code computer vision library
  • the server can call the corresponding function code through the above OpenCV to efficiently perform data processing on the image data.
  • the combination diagram is divided into a plurality of rectangular units.
  • the following may be included: according to the coordinates of the intersection point in the combination diagram, determine the The coordinates of the end points of the rectangular frame; the dividing line is determined according to the coordinates of the end points of the rectangular frame; and the combined image is divided into a plurality of rectangular units according to the dividing lines.
  • the endpoint coordinates of the rectangular frame are determined according to the coordinates of the intersection point in the combination diagram, and in specific implementation, the following content may be included: the coordinates of the intersection point in the combination diagram and the rectangular frame are performed Position comparison to determine the four endpoints of each rectangular frame from the intersection, and then determine the coordinates of the endpoints of each rectangular frame.
  • the above-mentioned determination of the dividing line according to the coordinates of the end points of the rectangular frame may include the following content: according to the coordinates of the four end points of each rectangular frame, the outline line surrounding the rectangular frame is determined as the corresponding dividing line. Furthermore, subsequent division can be performed along the above division line, and each rectangular unit can be obtained from the combination diagram.
  • the method further includes the following content: generating position coordinates of the rectangular units according to the coordinates of the end points of the rectangular frame.
  • the position coordinates of the above rectangular unit can be specifically understood as a type used to indicate the position of the rectangular unit in the image of the combined image or describe the position of the rectangular unit and other adjacent rectangular units in the image of the combined image Parameter data of the relationship.
  • the coordinates of the center point of the rectangular frame may be calculated as the position coordinates of the corresponding rectangular unit according to the endpoint coordinates of the four end points of the rectangular frame. You can also calculate the coordinates of the center point of each rectangular frame first, and then follow the preset arrangement order, for example, from top to bottom and from left to right, according to the coordinates of the center point of each rectangular frame, arrange in order For each rectangular unit, determine the row number and column number of each sorted rectangular unit as the position coordinates of the corresponding rectangular unit.
  • the above-listed methods for determining the position coordinates of the rectangular unit are only schematic illustrations. During specific implementation, according to the specific situation, other suitable methods may also be used to determine the position coordinates of the rectangular unit. This specification is not limited.
  • S57 Perform optical character recognition on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units.
  • each rectangular unit of the plurality of rectangular units may be subjected to separate optical character recognition to separately identify text characters in each rectangular unit, and then determine the location of each rectangular unit. Contains text information.
  • the text information contained in the rectangular unit may be left blank.
  • the rectangular units containing text information adjacent to the position coordinates may be stitched according to the position coordinates of each rectangular unit, and the rectangular units containing text information may be placed in the corresponding At the location of the data, so as to obtain the complete table data.
  • the combined graph is obtained by acquiring and extracting the graphic features such as the morphological vertical line and the morphological horizontal line in the image data; then the combined graph is divided into a plurality of rectangular units, and each rectangular unit is Optical character recognition to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving the problem of extracting table data existing in the existing methods The technical problem of large error and inaccuracy can be recognized efficiently and accurately, and the content of the table in the image data can be completely restored.
  • the method may further include the following when the method is specifically implemented: performing the image data of the text to be processed Preprocessing, wherein the preprocessing includes: converting the image data into a grayscale image; and/or performing Gaussian smoothing on the image data to filter out noise interference.
  • the preprocessing includes: converting the image data into a grayscale image; and/or performing Gaussian smoothing on the image data to filter out noise interference.
  • the above-mentioned extraction of the combined image from the image data may include the following content: search and obtain morphological vertical lines and morphological horizontal lines in the image data; connect the morphology The vertical line and the morphological horizontal line obtain the combined diagram.
  • the above search and obtain the morphological vertical line and the morphological horizontal line in the image data may include the following content: search and obtain the morphological vertical line in the image data through the getStructuringElement function Lines and morphological horizontal lines.
  • the method may further include the following contents: performing feature enhancement processing on the obtained morphological vertical line and morphological horizontal line respectively, wherein the feature enhancement processing includes at least one of the following: corrosion treatment And expansion treatment.
  • the morphological vertical line and the morphological horizontal line may be etched first, and then the morphological vertical line and the morphological horizontal line after the etching process may be expanded.
  • the white noise generated by the foreground of the morphological vertical line and the morphological horizontal line can be eliminated through the etching process, making the morphological vertical line and the morphological horizontal line clearer, but the morphological vertical line and the The graphical elements of the morphological horizontal lines are reduced. Therefore, after corroding the morphological vertical line and the morphological horizontal line, the morphological vertical line and the morphological horizontal line with a constant size can be recovered by the expansion treatment to be more clear.
  • the above-mentioned combination chart is only that the graphic features are similar to the table data, but it may not be table data.
  • the large text character "Tian” also has graphic features similar to table data. Therefore, the extracted combination chart can be tested to determine whether the combination chart meets the preset table format requirements, so as to more accurately determine whether the combination chart is real table data, and then can only be determined as table data.
  • the combination graph performs data processing, thereby reducing waste of resources and improving processing efficiency.
  • the method may further include: acquiring coordinates of the intersection point in the combined image, where the intersection point is the combination Pixels at the position where the morphological vertical line and the morphological horizontal line intersect in the figure; search and obtain the rectangular frame in the combined map; determine the endpoint coordinates of the rectangular frame according to the coordinates of the intersection point in the combined map; The endpoint coordinates of the rectangular frame determine whether the combined image meets the preset table format requirements.
  • the coordinates of the intersection point in the combined graph in the image can be searched and obtained by calling the opencv bitwise_and function.
  • the server may also obtain the coordinates of the intersection points in the combined graph in other suitable ways according to specific conditions. This specification is not limited.
  • the rectangular frame in the combined graph can be searched and obtained by calling the findContours function.
  • the findContours function is only a schematic illustration.
  • the server may also obtain the rectangular frame in the combination diagram in other suitable ways according to specific conditions. This specification is not limited.
  • the above-mentioned preset table format requirement can be specifically understood as a rule set for describing the graphic features of the data table different from other graphic structures.
  • the specific rules included in the above-mentioned preset table format requirements can be flexibly set according to specific conditions. For example, considering that the data table is different from other graphics, each grid graphic (or rectangular frame) is designed to fill in specific characters, that is, the minimum area of each grid graphic in the data table should be at least tolerable The next complete character. Therefore, the following rules for graphic area characteristics may be set: the minimum area of the grid pattern in the data table should be greater than a preset area threshold. Also considering that based on people's usual typesetting habits, the table data will be set to the center position when editing the table data.
  • the absolute value of the difference between the distance between the left border of the data table and the left border of the image and the distance between the right border of the data table and the right border of the image is less than the Set the distance threshold.
  • the following rules for the quantity characteristics of graphics may also be set: the number of grid graphics in the data table is greater than or equal to a preset quantity threshold (for example, 2), and so on.
  • the above determines whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame. In specific implementation, it may include the following: according to the endpoint coordinates of the rectangular frame, calculate The area of the rectangular frame; detecting whether the area of the rectangular frame is greater than a preset area threshold. If the area of the rectangular frame is greater than a preset area threshold, it is determined that the combined map meets the preset table format requirements.
  • the foregoing determines whether the combination map meets the preset table format requirements according to the endpoint coordinates of the rectangular frame.
  • the following may also be included: According to the endpoint coordinates of the rectangular frame in the combination map, respectively Determine the abscissa of the left border and the right border of the combined map; calculate the distance between the left border of the combined map and the left border of the image data based on the left border of the combined map.
  • a distance calculate the distance between the right border of the combination map and the right border of the image data according to the abscissa of the right border of the combination map, and record it as the second distance; calculate the distance difference between the first distance and the second distance Compare the absolute value of the difference with a preset distance threshold to detect whether the absolute value of the distance difference is less than the preset distance threshold. If the absolute value of the distance difference is less than a preset distance threshold, it is determined that the combination map meets the preset table format requirements.
  • the above-mentioned dividing the combined image into a plurality of rectangular units may include the following: determining the dividing line according to the coordinates of the end points of the rectangular frame; dividing the combined image into the following according to the dividing line A plurality of rectangular units, and generating position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.
  • the image data of the text to be processed may specifically include: a scanned image or a photograph containing a contract to be processed.
  • the image data of the to-be-processed text listed above is only for better explaining the embodiments of the present specification.
  • the image data of the text to be processed may also include image data of other types and contents, for example, a video screenshot containing a manual to be processed, etc. This specification is not limited.
  • the method for obtaining the table data is that the combined picture is obtained by obtaining and extracting the graphic features such as the morphological vertical line and the morphological horizontal line in the image data; then the combined map is divided into multiple The rectangular units are divided into optical characters for each rectangular unit to obtain the text information contained in each rectangular unit, and then the rectangular units containing the text information are combined and restored according to the position coordinates of the rectangular unit to obtain complete table data, thereby solving It solves the technical problems of large error and inaccuracy in the extraction of table data in the existing methods, so that it can be efficiently and accurately identified, and the content of the table in the image data can be completely restored; after the extraction of the combined image, according to the combined image
  • the included intersections, rectangular frames and other graphic factors detect whether the extracted combined image is tabular data in the text, thereby avoiding mistakenly identifying non-tabular data as tables, reducing errors, and improving the accuracy of obtaining tabular data.
  • An embodiment of this specification also provides a server including a processor and a memory for storing processor-executable instructions.
  • the following steps may be performed according to the instructions: acquiring image data of text to be processed; Extracting a combination diagram from the image data, wherein the combination diagram is a graph including vertical morphological and morphological horizontal lines; the combination diagram is divided into a plurality of rectangular units, wherein the plurality of rectangles The units carry position coordinates; perform optical character recognition on the multiple rectangular units to determine the text information contained in the multiple rectangular units; according to the position coordinates of the rectangular units, combine the rectangular units containing text information to obtain Tabular data.
  • this specification also provides another specific server, where the server includes a network communication port 601, a processor 602, and a memory 603.
  • the cables are connected so that each structure can perform specific data interactions.
  • the network communication port 601 may be specifically used to input image data of text to be processed
  • the processor 602 may be specifically used to extract a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines; the combined image is divided into A plurality of rectangular units, wherein the plurality of rectangular units respectively carry position coordinates; optical character recognition is performed on the plurality of rectangular units to determine the text information contained in the plurality of rectangular units; according to the position of the rectangular unit Coordinates, combined with rectangular cells containing text information, get table data.
  • the memory 603 may specifically be used to store image data of text to be processed input via the network communication port 601 and store corresponding instruction programs based on the processor 602.
  • the network communication port 601 may be a virtual port that is bound to different communication protocols so that different data can be sent or received.
  • the network communication port may be port 80 responsible for web data communication, port 21 responsible for FTP data communication, or port 25 responsible for mail data communication.
  • the network communication port may also be a physical communication interface or a communication chip.
  • it can be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it can also be a Wifi chip; it can also be a Bluetooth chip.
  • the processor 602 can be implemented in any suitable manner.
  • the processor may adopt, for example, a microprocessor or a processor and a computer-readable medium storing a computer-readable program code (such as software or firmware) executable by the (micro)processor, a logic gate, a switch, an application specific integrated circuit ( Application Specific (Integrated Circuit, ASIC), programmable logic controller and embedded microcontroller, etc.
  • a computer-readable program code such as software or firmware
  • the memory 603 may include multiple levels. In a digital system, as long as it can store binary data, it can be a memory. In an integrated circuit, a circuit with a storage function without a physical form is also called a memory. , Such as RAM, FIFO, etc.; in the system, the storage device with physical form is also called memory, such as memory stick, TF card, etc.
  • the embodiments of the present specification also provide a computer storage medium based on the above-mentioned table data acquisition method, where the computer storage medium stores computer program instructions, which are implemented when the computer program instructions are executed: acquiring image data of text to be processed ; Extract a combination graph from the image data, wherein the combination graph is a graph that includes crossed morphological vertical lines and morphological horizontal lines; divide the combination map into a plurality of rectangular units, wherein, the A plurality of rectangular units respectively carry position coordinates; perform optical character recognition on the plurality of rectangular units respectively to determine the text information contained in the plurality of rectangular units; according to the position coordinates of the rectangular units, combine rectangles containing text information Unit, get the table data.
  • the storage medium includes, but is not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), cache (Cache), hard disk (Hard Disk Drive, HDD) Or memory card (Memory Card).
  • RAM Random Access Memory
  • ROM read-only memory
  • cache cache
  • HDD Hard Disk Drive
  • Memory Card Memory Card
  • the memory may be used to store computer program instructions.
  • the network communication unit may be an interface configured to perform network connection communication according to the standard specified by the communication protocol.
  • the embodiment of the present specification also provides an apparatus for acquiring table data.
  • the apparatus may specifically include the following structural modules:
  • the obtaining module 71 can be specifically used to obtain image data of text to be processed
  • the extracting module 72 may be specifically used for extracting a combined image from the image data, wherein the combined image is a graph including vertical morphological lines and horizontal morphological lines;
  • the segmentation module 73 may be specifically used to segment the combined image into multiple rectangular units, where the multiple rectangular units each carry position coordinates;
  • the recognition module 74 may be specifically configured to perform optical character recognition on the plurality of rectangular units respectively and determine the text information contained in the plurality of rectangular units respectively;
  • the combining module 75 can be specifically used to combine rectangular units containing text information according to the position coordinates of the rectangular units to obtain table data.
  • the extraction module 72 may specifically include the following structural units:
  • the first search unit may specifically be used to search for and obtain morphological vertical lines and morphological horizontal lines in the image data
  • the connecting unit may specifically be used to connect the morphological vertical line and the morphological horizontal line to obtain the combined graph.
  • the apparatus may further specifically include a detection module, configured to detect whether the combination graph meets a preset table format requirement.
  • the detection module may specifically include the following structural units:
  • the obtaining unit may be specifically configured to obtain the coordinates of the intersection point in the combined graph, where the intersection point may specifically be a pixel point at a position where the morphological vertical line and the morphological horizontal line intersect in the combined map;
  • the second search unit may specifically be used to search for and obtain a rectangular frame in the combination diagram
  • the first determining unit may specifically be used to determine the coordinates of the end point of the rectangular frame according to the coordinates of the intersection in the combined graph;
  • the second determining unit may be specifically configured to determine whether the combined image meets the preset table format requirements according to the endpoint coordinates of the rectangular frame.
  • the second determining unit may be specifically configured to calculate the area of the rectangular frame according to the coordinates of the endpoints of the rectangular frame; and detect whether the area of the rectangular frame is greater than a preset area threshold.
  • the segmentation module 73 may specifically include the following structural units:
  • the third determining unit can be specifically used to determine the dividing line according to the coordinates of the end points of the rectangular frame
  • the dividing unit may specifically be used to divide the combined image into a plurality of rectangular units according to the dividing line, and generate position coordinates of the rectangular unit corresponding to the rectangular frame according to the end point coordinates of the rectangular frame.
  • the apparatus may further specifically include a preprocessing module for preprocessing the image data of the text to be processed, wherein the preprocessing may specifically include: converting the image data to gray Degree image; and/or, perform Gaussian smoothing on the image data, etc.
  • the image data of the text to be processed may specifically include: a scanned image or a photograph containing a contract to be processed.
  • the image data of the to-be-processed text listed above is only for better explaining the embodiments of the present specification.
  • the image data of the text to be processed may also include image data of other types and contents, for example, a video screenshot containing a manual to be processed, etc. This specification is not limited.
  • the units, devices, or modules explained in the foregoing embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions.
  • the functions are divided into various modules and described separately.
  • the functions of each module may be implemented in one or more software and/or hardware, or the modules that implement the same function may be implemented by a combination of multiple submodules or subunits.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a division of logical functions.
  • there may be another division manner for example, multiple units or components may be combined or integrated To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical, or other forms.
  • the table data acquisition device provided by the embodiment of the present specification is obtained by the extraction module and extracted according to the morphological vertical lines and morphological horizontal lines in the image data to obtain the combined picture;
  • the module divides the combined image into multiple rectangular units, and performs optical character recognition on each rectangular unit type to obtain the text information contained in each rectangular unit, and then uses the combination module to divide the rectangle containing the text information according to the position coordinates of the rectangular unit.
  • Units are combined and restored to obtain complete table data, thereby solving the technical problem of large error and inaccuracy in the existing method of extracting table data, so as to achieve efficient and accurate identification, and completely restore the table content in the image data;
  • the combination module detects whether the extracted combination chart is tabular data in the text according to the intersection points, rectangular frames and other graphical factors contained in the combo chart, so as to avoid mistakenly identifying non-table data as Tables reduce errors and improve the accuracy of obtaining table data.
  • the method can be logically programmed to enable the controller to use logic gates, switches, special integrated circuits, programmable logic controllers and embedded To achieve the same function in the form of a microcontroller, etc. Therefore, such a controller can be regarded as a hardware component, and the device for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module of the implementation method and a structure within a hardware component.
  • program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types.
  • This specification can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network.
  • program modules may be located in local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Character Input (AREA)
  • Processing Or Creating Images (AREA)

Abstract

一种表格数据的获取方法、装置和服务器。其中,方法包括:获取待处理文本的图像数据;从图像数据中提取组合图,组合图为包含有交叉的形态学竖线和形态学横线的图形;将组合图分割成多个矩形单元;对矩形单元分别进行光学字符识别,确定矩形单元的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。先通过获取并根据图像数据中的形态学竖线和形态学横线等图形特征得到组合图;再将组合图分割成多个矩形单元分别进行光学字符识别,得到矩形单元的文本信息,并根据位置坐标进行组合还原得到表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题。

Description

表格数据的获取方法、装置和服务器 技术领域
本说明书属于互联网技术领域,尤其涉及一种表格数据的获取方法、装置和服务器。
背景技术
在生活、工作中常常会涉及到这样一类文本数据(例如,合同文件)除了包含有单独的文本字符(例如单纯的文字符号)外,还会包含有表格数据(例如,关于价格的统计列表),且这类表格数据在某些场景中还具有较高的信息价值,包含有人们较为关注的信息内容。
通常,数据获取方法往往是直接对包含有文本数据的扫描图片等图像数据进行光学字符识别,以识别并提取出图像数据中的文本信息,得到对应文本的电子档数据。
基于数据获取方法,在对图像数据中单独的文本字符进行识别提取时,具有相对较好的效果。但是,文本数据中的表格数据区别于上述单独的文本字符,除了包含有文本字符所携带的文本信息外,还具有一定的图形特征,例如,包含有分隔线、分隔框等。相对于单独的文本字符,表格数据的结构更为复杂,识别起来更为困难。导致通过现有的数据获取方法在识别图像数据中的表格数据时,很容易出现误差。例如,会将表格中的分隔栏错误识别成了数字。或者,对表格中N行M列中的文本字符的识别出现错位等等。因此,亟需一种能够精确识别,并完整恢复得到图像数据中的表格数据的方法。
发明内容
本说明书目的在于提供一种表格数据的获取方法、装置和服务器,以解决现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。
本说明书提供的一种表格数据的获取方法、装置和服务器是这样实现的:
一种表格数据的获取方法,包括:获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所 述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
一种表格数据的获取装置,包括:获取模块,用于获取待处理文本的图像数据;提取模块,用于从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;分割模块,用于将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;识别模块,用于对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;组合模块,用于根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
一种服务器,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
一种计算机可读存储介质,其上存储有计算机指令,所述指令被执行时实现获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
本说明书提供的一种表格数据的获取方法、装置和服务器,由于先通过获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再将组合图分割成多个矩形单元分,对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而根据矩形单元的位置坐标将包含有文本信息的矩形单元进行组合,还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。
附图说明
为了更清楚地说明本说明书实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一 些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是在一个场景示例中,应用本说明书实施例提供的表格数据的获取方法的一种实施例的示意图;
图2是在一个场景示例中,应用本说明书实施例提供的表格数据的获取方法的一种实施例的示意图;
图3是在一个场景示例中,应用本说明书实施例提供的表格数据的获取方法的一种实施例的示意图;
图4是在一个场景示例中,应用本说明书实施例提供的表格数据的获取方法的一种实施例的示意图;
图5是本说明书实施例提供的表格数据的获取方法的流程的一种实施例的示意图;
图6是本说明书实施例提供的服务器的结构的一种实施例的示意图;
图7是本说明书实施例提供的表格数据的获取装置的结构的一种实施例的示意图。
具体实施方式
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书保护的范围。
考虑到现有的数据获取方法大多是针对包含有待处理文本的图像数据中的单独的文本字符的识别设计的。因此,在识别、提取图像数据中的文本字符所表征的文本信息时具有较好的准确度。但是,有些类型的文本数据,例如合同文本,还会包含有一些表格内容。这类表格内容相对与单独的文本字符结构更为复杂,通常除了包含有文本字符外,还具有一定的图形特征,例如还会同时包含有一些图形形态学的结构。导致对这类表格数据的识别、提取以及重建更加复杂、困难。通过现有的数据获取方法对图形数据中的这类表格数据直接进行识别、提取时,容易将文本字符和图形特征混淆,无法精准地区分、处理其中的文本字符和图形特征,导致容易出现误差,例如,将表格数据中的分隔栏等图形结构错误地识别成了文本字符,或者对表格数据中不同位置的文本信息的 识别提取出现错位等。即,通过现有的获取方法处理图像数据中的表格数据时效果往往不够理想,存在提取表格数据误差大、不准确的技术问题。
针对产生上述问题的根本原因,本说明书具体分析了表格数据所同时具备的文本字符与图形结构两种不同属性对象识别时的不同特点,通过先获取图像数据中的形态学竖线和形态学横线等图像结构特征,从图像数据中找到可能形成表格数据的组合图;再将上述组合图分割成多个矩形单元,对各个矩形单元分别单独进行光学字符识别,以得到矩形单元的文本信息;进而根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,以恢复、重建图像的完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。
本说明书实施方式提供一种表格数据方法的获取方法,所述表格数据的获取方法具体可以应用于包含有多个服务器的图像数据处理***中。例如,法务合同扫描图片的处理***。
其中,上述***具体可以包括有一个用于负责从图像数据中识别、获取文本数据内的表格数据的服务器。该服务器具体实施时,可以通过检测图像数据中的形态学竖线、形态学横线等图形结构特征,从所获取的待处理文本的图像数据中提取出组合图;再将组合图根据坐标分割成多个矩形单元,对多个矩形单元中的各个矩形单元分别进行光学字符识别,以识别、确定出各个矩形单元所包含的文本信息;进而根据矩形单元的坐标,组合、拼接上述包含有文本信息的矩形单元,从而得到完整的表格数据。
在本实施方式中,所述服务器可以理解为是一种应用于业务***一侧的,能够实现数据传输、数据处理等功能的业务服务器。具体的,所述服务器可以为一个具有数据运算、存储功能以及网络交互功能的电子设备;也可以为运行于该电子设备中,为数据处理、存储和网络交互提供支持的软件程序。在本实施方式中,并不具体限定所述服务器的数量。所述服务器具体可以为一个服务器,也可以为几个服务器,或者,由若干服务器形成的服务器集群。
在一个场景示例中,可以参阅图1所示,可以应用本说明书实施例提供的表格数据的获取方法对法务平台所接收到的包含有合同的图像数据进行处理,以提取合同中的表格数据。
在本场景示例中,法务平台可以将用户输入的包含有待处理合同的图像数据分配 给平台中用于获取表格数据的服务器中。
其中,上述法务平台具体可以用于将用户上传输入的包含有合同的图像数据(例如包含有合同的扫描图片或者照片)中的文本信息进行识别、提取,以将合同内容转化为电子档数据,保存于法务平台的数据库中,方便用户的调取、管理。
服务器在接收到包含有合同的图像数据后,可以参阅图2所示先对图像进行预处理,以减少误差干扰,提高后续识别、获取表格数据的精度。
具体的,上述服务器具体可以配置有OpenCV(即Open source Computer Vision Library,源代码计算机视觉库)。其中,上述OpenCV具体可以理解为一种关于计算机视觉的源代码的API函数库,该库中所包含的函数代码都经过了优化处理,调用、计算的效率相对较高。具体实施时,服务器可以通过上述OpenCV调用相应的函数代码,高效地对图像数据进行数据处理。
具体的,服务器可以先将图像数据进行灰度转换得到对应的灰度图像,再对灰度图像进行高斯平滑,以过滤掉灰度图像中比较明显的噪声信息,提高图像数据的精度,从而完成对图像数据的预处理。当然,需要说明的是,上述预处理过程中仅以将图像数据转换为灰度图像为例进行示意性说明。具体实施时,根据具体场景和精度要求,也可以将图像数据先转换为二值图,再基于二值图进行后续的表格数据的获取。对此,本说明书不作限定。
在完成对包含有合同的图像数据的预处理后,服务器可以先基于形态学,对图像数据中的图形结构特征(例如结构元素等)进行扫描检索,以便先从图像数据中找到区别于单独的文本字符的,具有一定图形特征的,可能形成表格的图形:组合图。
具体实施时,以图像数据中具体的某一帧图像为例,例如,以包含有合同的图像数据中的第五页图像为例。服务器可以扫描、搜索该帧图像中的形态学竖线和形态学横线。
上述形态学竖线、形态学横线具体可以理解为一种区别于文本字符的,与图形相关的结构元素。可以参阅图3所示。上述形态学竖线具体可以是图像中包含有沿垂直方向的直线段的图像单元或者结构元素。上述形态学横线具体可以是图像中包含有沿水平方向的直线段的图像单元或者结构元素。
具体的,服务器可以通过调用getStructuringElement函数对图像中的结构元素进行搜索,从中找到所有的形态学竖线和形态学横线。当然,需要说明的是上述所列举的通 过调用getStructuringElement函数从图像中获取形态学竖线和形态学横线的方式只是一种示意性说明。具体实施时,根据具体情况,也可以通过其他合适的方式获取图像中的形态学竖线和形态学横线。对此,本说明书不作限定。
考虑到在表格数据中每一个形态学横线大多是与形态学竖线中的一个或多个相交。因此,服务器在获取得到该帧图像中的形态学竖线和形态学横线后,可以进一步搜索出包含有相交的形态学竖线和形态学横线的结构的图形作为可能形成的表格数据的组合图,以避免对明显不具备表格数据的图形特征的图形结构进行后续处理,提高了处理效率。
在本场景示例中,为了避免所识别提取的形态学横线和形态学竖线发生错位,可以在原图像上直接进行形态学横线和形态学竖线的提取,并将所提取得到的形态学横线和形态学竖线覆盖在提取位置处。
在获取得到了上述具备较为明显的数据表格的图形特征、可能形成表格数据的组合图后,可以对该组合图进行进一步检测,通过检测该组合图是否满足预设的表格格式要求,以更加精确地判断该组合图是否为数据表格。
其中,上述预设的表格格式要求具体可以理解为一种用于描述数据表格区别于其他图形结构的图形特征的规则集。
例如,考虑到数据表格不同于其他的图形,其中每一个格子图形(或称矩形框,可以参阅图3所示)都是用于填充具体的字符设计的,即数据表格中每一个格子图形的最小面积应当至少能够容得下一个完整的字符。因此,可以设置有如下的针对图形面积特征的规则:数据表格中的格子图形的最小面积应当大于预设的面积阈值。又考虑到基于人们通常的排版习惯,在编辑表格数据时会将表格数据设置为居中的位置。因此,还可以设置有如下针对图形位置特征的规则:数据表格的左侧边界与图像的左侧边界的距离同数据表格右侧边界与图像的右侧边界的距离的差值的绝对值小于预设的距离阈值。还考虑到在使用表格数据的目的,通常为了将至少两个或者更多个数据列成表格进行对比、比较,以便更加清晰地展示不同数据之间的差异。因此,还可以设置有如下针对图形的数量特征的规则:数据表格中的格子图形的数量大于等于预设的数量阈值(例如,2个)等。
当然,需要说明的是,上述所列举的预设的表格格式要求所包含的具体规则只是为了更好地说明本说明书实施方式。具体实施时,根据具体的应用场景和处理要求,还 可以引入其他类型或内容的规则作为上述预设的表格格式要求。对此,本说明书不作限定。
在本场景示例中,服务器为了确定所提取的组合图是否满足预设的表格格式要求,具体实施时,可以先检索组合图中形态学横线与形态学竖线在图像位置相同的点,作为交点,进而确定所述组合图中的各个交点在该帧图像中的位置坐标。
其中,上述交点具体可以理解为在该帧图像中,组合图中形态学竖线和形态学横线相交位置处的像素点。具体可以参阅图3所示。
具体的,服务器可以通过调用opencv bitwise_and函数搜索并获取图像中所述组合图中的交点坐标。当然,需要说明的是,上述所列举的通过opencv bitwise_and函数获取交点坐标只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的交点的坐标。对此,本说明书不作限定。
同时,服务器还可以对上述组合图进行进一步的图形结构元素的搜索,寻找到具有矩形(或者方形)结构(即对应表格中的一个格子)的图形元素作为所述组合图中的矩形框。可以参阅图3所示。
具体的,服务器可以通过调用findContours函数搜索并获取所述组合图中的矩形框。当然,需要说明的是,上述所列举的通过findContours函数获取组合图中的矩形框只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的矩形框。对此,本说明书不作限定。
进一步,服务器可以根据所确定的上述交点坐标以及组合图中的矩形框,通过位置比较,分别确定组合图中的各个矩形框的四个端点处的端点坐标。进而可以根据组合图中矩形框的端点坐标,判断组合图是否满足预设的表格格式要求。
例如,服务器可以根据矩形框的端点坐标,计算出该矩形框的长度和宽度,进而根据长度和宽度计算出矩形框的面积。再将矩形框的面积与预设的面积阈值进行比较。如果组合图中各个矩形框的面积都大于预设的面积阈值,则可以判断组合图满足预设的表格格式要求。
又例如,服务器还可以比较组合图中各个矩形框的端点坐标的横坐标的数值,找到横坐标数值最小的端点作为组合图左侧边界上的端点,并将该端点的横坐标确定为左侧边界的横坐标,再根据上述左侧边界的横坐标计算组合图左侧边界与图像的左侧边界的距离,记为d1。类似的,服务通过比较端点的横坐标的数值,找到横坐标数值最大的 端点作为组合图右侧边界上的端点,并将该端点的横坐标确定为右侧边界的横坐标,再根据上述右侧边界的横坐标计算组合图右侧边界与图形的右侧边界的距离,记为d2。进一步,服务器可以计算d1与d2的差值的绝对值,并将上述差值的绝对值与预设的距离阈值进行比较。如果上述差值的绝对值小于等于预设的距离阈值,则可以判断上述组合图的整***于图像居中的位置,即满足预设的表格格式要求等。
当然,需要说明的是,上述所列举的判断组合图是否满足预设的表格格式要求的方式只是为了更好地说明本说明书实施方式。具体实施时,根据具体情况和精度要求,可以将上述两种判断方式组合,也可以引入其他合适的判断方式来判断组合图是否符合预设的表格格式要求。对此,本说明书不作限定。
在确定组合图符合预设的表格格式后,服务器可以确定当前提取的组合图确实是图像中数据表格。可以对该组合图进行后续的文本信息的提取。
考虑到上述组合图通常会包含有多个格子图形或者矩形框,直接对组合图中的文本信息进行识别提取容易出现错位等问题。因此,服务器可以先将上述组合图分割为多个矩形单元。其中,每个矩形单元分别与组合图中的一个矩形框一一对应;但又不同于矩形框这种单独的图形结构元素,每一个矩形单元内部包含有文本字符或者空白状态信息。进而可以对每个矩形单元分别进行单独的光学字符识别,以准确地识别出矩形单元中的文本字符,确定出各个矩形单元所包含的文本信息。
具体的,服务器可以先根据矩形框的端点坐标确定出围成矩形框的轮廓线作为分割线,进而可以沿着轮廓线进行切割,从组合图中分割对应该矩形框的矩形单元。例如,参阅图4所示。对于组合图中某一个矩形框的四个端点坐标分别为A(15,60)、B(15,40)、C(30,40)和D(30,60)。具体实施时,服务器可以从端点A出发,按照预设的划分规则,保持横坐标15不变,寻找到纵坐标不同的端点,即端点B,进而将端点A与端点B相连。然后,服务器再从端点B出发,按照预设的划分规则,保持纵坐标40不变,寻找到横坐标不同的端点,即端点C,进而将端点B与端点C相连。接着,服务器再从端点C出发,按照预设的划分规则,保持横坐标30不变,寻找到纵坐标不同的端点,即端点D,进而将端点C与端点D相连。最后,服务器再从端点D出发,按照预设的划分规则,保持纵坐标60不变,寻找到横坐标不同的端点,即端点A,进而将端点D与端点A相连。这样可以得到一段封闭的连接线:A到B到C到D到A,即该矩形框轮廓线。进一步,服务器可以以上述轮廓线作为分割线,沿着上述轮廓线将组合图中包含有文本信息的矩形框分割出来,得到对应的矩形单元。
按照上述方式可以分割出组合图中的各个矩形单元。当然,需要说明的是,上述所列举的分割矩形单元的方式只是为了更好地说明本说明书实施方式。具体实施时,根据具体情况也可以采用其他合适的方式从所述组合图中分割出多个矩形单元。对此,本说明书不作限定。
需要说明的是,在分割组合图的过程中,服务器还会根据矩形框的端点坐标生成矩形单元对应的位置坐标。
其中,上述位置坐标具体可以理解为一种用于指示矩形单元在组合图的图像中的位置或者描述组合图的图像中矩形单元与其他相邻的矩形单元的位置关系的参数数据。
具体的,服务器可以根据矩形框的四个端点的端点坐标,计算该矩形框中心点的坐标作为对应的矩形单元的位置坐标。也可以服务器先分别计算出各个矩形框的中心点的坐标,再按照预设的排列顺序,例如,按照从上到下从左到右的顺序,根据各个矩形框的中心点的坐标,确定出各个矩形单元的行编号和列编号,作为对应的矩形单元的位置坐标。例如,根据矩形框的中心点的坐标,确定矩形框A位于为组合图中的第一行第二列,即对应的行编号为1,列编号为2,因此可以将“1-2”作为矩形框A所对应的矩形单元的位置坐标。当然,需要说明的是上述所列举的确定矩形单元的位置坐标的方式只是一种示意性说明。具体实施时,根据具体情况,还可以采用其他合适的方式确定矩形单元的位置坐标。对此,本说明书不作限定。
在分割组合图得到多个对应的矩形单元后,服务器可以对多个矩形单元中的各个矩形单元分别进行光学字符识别(即OCR,Optical Character Recognition)识别确定出各个矩形单元中的文本字符,进而确定出各个矩形单元所包含的文本信息。如果矩形单元中没有识别到文本字符,则将该矩形单元所包含的文本信息置空。这样就可以得到多个分别包含有对应的文本信息的矩形单元。
进一步,服务器可以根据各个矩形单元的位置坐标,将上述得到的包含有文本信息的矩形单元进行组合拼接。例如,可以根据矩形单元的位置坐标“1-2”,将包含有文本信息的矩形单元设置在第一行第二列的位置处。按照上述方式,依次将多个包含有文本信息的矩形单元设置到对应的位置处,从而可以还原得到完整的数据表格。当然,需要说明的是,上述所列举的组合方式只是一种示意性说明。具体实施时,也可以根据其他类型的位置坐标,采用其他的组合方式进行组合拼接。对此,本说明书不作限定。
按照上述方式,服务器可以分别对包含有待处理合同的图像数据中的每张图像分 别进行表格数据的检测,在确定存在表格数据的情况下再进行表格数据的获取,从而提取得到图像数据中完整的表格数据,并将提取到的表格数据反馈给法务平台,以便整理生成针对该合同的电子档数据进行保存。
在另一个场景示例中,为了使得所获取的表格数据中表格线条更加的清晰,以提高后续进行光学字符识别提取文本信息的精度,具体实施时,服务器在通过扫描、搜索得到该帧图像中的形态学竖线和形态学横线后,进一步还可以对所得到的形态学竖线和形态学横线分别进行特征强化处理,使得所得到的形态学竖线、形态学横线更加清晰。
其中,上述特征强化处理具体可以是一种形态学处理,具体可以包括腐蚀处理和/或膨胀处理。具体实施时,基于形态处理,可以通过将卷积核的区域滑动至该帧图像中,以对区域中间的像素点的数据值进行重置(重置为0或1)。具体的,可以先进行腐蚀处理,再进行膨胀处理。
具体的,上述腐蚀处理,可以理解为一种做与运算,具体通过根据卷积核的大小,将靠近前景的像素点腐蚀(即将对应像素点的数值重置变为0),使得前景物体变小,进而可以使得形态学竖线或形态学横线周围的白色区域减少,达到去除白噪声的效果;同时还可以将与上述形态学竖线或形态学横线相邻甚至相连的结构元素断开。
在进行完腐蚀处理后,由于腐蚀会使得图像的结构元素相对发生缩小,因此,可以继续对腐蚀处理后的形态学竖线或形态学横线进行膨胀处理。
上述膨胀处理,可以理解为一种做或运算,与腐蚀处理相反,通过膨胀可以对腐蚀后的图像进行放大复原,从而得到相对较清晰的、大小不变的形态学竖线和形态学横线。
由上述场景示例可见,本说明书提供的表格数据的获取方法,由于通过获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再将组合图分割成多个矩形单元分,对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而根据矩形单元的位置坐标将包含有文本信息的矩形单元进行组合还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。
参阅图5所示,本说明书实施例还提供了一种表格数据的获取方法,其中,该方法具体应用于服务器一侧。具体实施时,该方法可以包括以下内容:
S51:获取待处理文本的图像数据。
在本实施例中,上述待处理文本具体可以是待处理的合同文本,也可以是待处理的章程文本,还可以是待处理的说明书文本等。相应的,上述待处理文本的图像数据可以是包含有上述文本内容的扫描图片,也可以是包含有上述文本内容的照片,还可以是包含有上述文本内容的视频等等。对于上述待处理文本的图像数据的具体内容和形式,本说明书不作限定。
S53:从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形。
在本实施例中,上述形态学竖线、形态学横线具体可以理解为一种区别于文本字符的,与图形相关的结构元素。上述形态学竖线具体可以是图像中包含有沿垂直方向的直线段的图像单元或者结构元素。上述形态学横线具体可以是图像中包含有沿水平方向的直线段的图像单元或者结构元素。
在本实施例中,上述组合图具体可以理解为图像数据中具有与表格数据类似的图形特征的,例如也包含有交叉的形态学竖线和形态学横线的图形结构元素的组合图形。
在本实施例中,上述从所述图像数据中提取组合图,具体实施时,可以包括以下内容:搜索并获取所述图像数据中的形态学竖线和形态学横线;连接所述形态学竖线和所述形态学横线,得到所述组合图。
在本实施例中,上述搜索并获取所述图像数据中的形态学竖线和形态学横线,具体实施时,可以包括以下内容:通过调用OpenCV中的getStructuringElement函数对图像中的结构元素进行搜索,从中找到图像数据中的形态学竖线和形态学横线。当然,需要说明的是上述所列举的通过调用getStructuringElement函数从图像中获取形态学竖线和形态学横线的方式只是一种示意性说明。具体实施时,根据具体情况,也可以通过其他合适的方式获取图像中的形态学竖线和形态学横线。对此,本说明书不作限定。
在本实施例中,通过上述方式获取得到的形态学竖线和形态学横线还携带有在图像数据中的位置信息,进而可以根据形态学竖线和形态学横线的位置信息,连接对应的形态学竖线和形态学横线,得到所述组合图。
S55:将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标。
在本实施例中,上述矩形单元具体可以理解为一种与组合图中的一个矩形框一一对应,但又区别矩形框,包含有文本信息(例如填充有文本字符或者置空)的图像单元。
在本实施例中,上述矩形框具体可以理解为由两段形态学竖线和两段形态学横线组成的,单纯只包含图形特征的,矩形或方形形状的图形元素。其中,每一个矩形框可以认为是表格中的一个格子。
在本实施例中,将所述组合图分割成多个矩形单元,具体实施时,可以包括以下内容:获取所述组合图中的交点坐标;搜索并获取所述组合图中的矩形框;根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;根据所述矩形框的端点坐标,将所述组合图分割成多个矩形单元。
在本实施例中,上述交点具体可以理解为组合图中形态学竖线和形态学横线相交位置处的像素点。
在本实施例中,具体实施时,可以通过调用OpenCV中的opencv bitwise_and函数搜索并获取图像中所述组合图中的交点坐标。当然,需要说明的是,上述所列举的通过opencv bitwise_and函数获取交点坐标只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的交点的坐标。对此,本说明书不作限定。
在本实施例中,具体实施时,可以通过调用OpenCV中的findContours函数搜索并获取所述组合图中的矩形框。当然,需要说明的是,上述所列举的通过findContours函数获取组合图中的矩形框只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的矩形框。对此,本说明书不作限定。
在本实施例中,上述OpenCV(Open source Computer Vision Library,源代码计算机视觉库)具体可以理解为一种关于计算机视觉的源代码的API函数库,该库中所包含的函数代码都经过了优化处理,调用、计算的效率相对较高。具体实施时,服务器可以通过上述OpenCV调用相应的函数代码,高效地对图像数据进行数据处理。
在本实施例中,上述根据所述矩形框的端点坐标,将所述组合图分割成多个矩形单元,具体实施时,可以包括以下内容:根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;根据矩形框的端点坐标确定分割线;根据所述分割线将所述组合图分割成多个矩形单元。
在本实施例中,上述根据所述组合图中的交点坐标,确定所述矩形框的端点坐标,具体实施时,可以包括以下内容:将所述组合图中的交点坐标与所述矩形框进行位置比较,以从交点中确定出各个矩形框的4个端点,进而确定出各个矩形框的端点坐标。
在本实施例中,上述根据矩形框的端点坐标确定分割线,具体实施时,可以包括以下内容:根据各个矩形框的4个端点坐标确定出围成矩形框的轮廓线作为对应的分割线。进而后续可以沿着上述分割线进行分割,从组合图中分割得到各个矩形单元。
在本实施例中,在分割所述组合图得到多个矩形单元的同时,所述方法还包括有以下内容:根据所述矩形框的端点坐标,生成矩形单元的位置坐标。
在本实施例中,上述矩形单元的位置坐标,具体可以理解为一种用于指示矩形单元在组合图的图像中的位置或者描述组合图的图像中矩形单元与其他相邻的矩形单元的位置关系的参数数据。
在本实施例中,具体实施时,可以根据矩形框的四个端点的端点坐标,计算该矩形框中心点的坐标作为对应的矩形单元的位置坐标。也可以先分别计算出各个矩形框的中心点的坐标,再按照预设的排列顺序,例如,按照从上到下从左到右的顺序,根据各个矩形框的中心点的坐标,按顺序排列各个矩形单元,并确定出排序后的各个矩形单元的行编号和列编号,作为对应的矩形单元的位置坐标等。当然,需要说明的是上述所列举的确定矩形单元的位置坐标的方式只是一种示意性说明。具体实施时,根据具体情况,还可以采用其他合适的方式确定矩形单元的位置坐标。对此,本说明书不作限定。
S57:对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息。
在本实施例中,具体实施时,可以对所述多个矩形单元中的各个矩形单元分别进行单独的光学字符识别,以分别识别出各个矩形单元中的文本字符,进而确定出各个矩形单元所包含的文本信息。
在本实施例中,具体实施时,在从矩形单元中没有识别得到文本字符时,可以将该矩形单元所包含的文本信息置空。
S59:根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
在本实施例中,具体实施时,可以根据各个矩形单元的位置坐标,将位置坐标相邻的包含有文本信息的矩形单元进行拼接,并按照位置坐标将包含有文本信息的矩形单元放置于对应的位置处,从而组合得到了完整的表格数据。
在本实施例中,由于通过获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再将组合图分割成多个矩形单元分,对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而根据矩形单元的位置坐标将包 含有文本信息的矩形单元进行组合还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容。
在一个实施例中,为了减少噪声干扰,提高表格数据的获取精度,在获取待处理文本的图像数据后,所述方法具体实施时还可以包括以下内容:对所述待处理文本的图像数据进行预处理,其中,所述预处理包括:将所述图像数据转换为灰度图像;和/或,对所述图像数据进行高斯平滑处理,以过滤掉噪声干扰。当然,需要说明的是,上述所列举的预处理方式只是为了更好地说明本说明书实施方式。具体实施时,根据具体情况和精度要求还可以采用其他合适的处理方式进行预处理。对此,本说明书不作限定。
在一个实施例中,上述从所述图像数据中提取组合图,具体实施时,可以包括以下内容:搜索并获取所述图像数据中的形态学竖线和形态学横线;连接所述形态学竖线和所述形态学横线,得到所述组合图。
在一个实施例中,上述搜索并获取所述图像数据中的形态学竖线和形态学横线,具体实施时,可以包括以下内容:通过getStructuringElement函数搜索并获取所述图像数据中的形态学竖线和形态学横线。
在一个实施例中,为了使得所获取的形态学竖线和形态学横线清晰,减少对后续文本信息识别的误差影响,在搜索并获取所述图像数据中的形态学竖线和形态学横线后,所述方法具体实施时还可以包括以下内容:对所述获取的形态学竖线和形态学横线分别进行特征强化处理,其中,所述特征强化处理包括以下至少之一:腐蚀处理和膨胀处理。
在本实施例中,具体实施时,可以先对形态学竖线和形态学横线进行腐蚀处理,再对腐蚀处理后的形态学竖线和形态学横线进行膨胀处理。
在本实施例中,通过腐蚀处理可以消除形态学竖线和形态学横线的前景所产生的白噪声,使得形态学竖线和形态学横线更加清晰,但也会将形态学竖线和形态学横线的图形元素进行缩小。因此,在对形态学竖线和形态学横线进行腐蚀处理后,还可以通过膨胀处理恢复得到更加清晰,但大小不变的形态学竖线和形态学横线。
在一个实施例中,考虑到上述组合图只是图形特征与表格数据近似,但也有可能不是表格数据。例如,尺寸较大的文本字符“田”也具有与表格数据近似的图形特征。因此,可以进行所提取的组合图进行检测,以确定组合图是否满足预设的表格格式要求,以更加精确地判断出组合图是否为真正的表格数据,进而后续可以仅对确定为表格数据 的组合图进行数据处理,从而减少了资源的浪费,提高了处理效率。
在一个实施例中,在从所述图像数据中提取组合图后,所述方法具体实施时,还可以包括以下内容:获取所述组合图中的交点坐标,其中,所述交点为所述组合图中形态学竖线和形态学横线相交位置处的像素点;搜索并获取所述组合图中的矩形框;根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求。
在本实施例中,具体实施时,可以通过调用opencv bitwise_and函数搜索并获取图像中所述组合图中的交点坐标。当然,需要说明的是,上述所列举的通过opencv bitwise_and函数获取交点坐标只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的交点的坐标。对此,本说明书不作限定。
在本实施例中,具体实施时,可以通过调用findContours函数搜索并获取所述组合图中的矩形框。当然,需要说明的是,上述所列举的通过findContours函数获取组合图中的矩形框只是一种示意性说明。具体实施时,服务器也可以根据具体情况,采用其他合适的方式获取组合图中的矩形框。对此,本说明书不作限定。
在本实施例中,上述预设的表格格式要求具体可以理解为一种用于描述数据表格区别于其他图形结构的图形特征的规则集。
具体实施时,可以根据具体情况,灵活设置上述预设的表格格式要求所包含的具体规则。例如,考虑到数据表格不同于其他的图形,其中每一个格子图形(或称矩形框)都是用于填充具体的字符设计的,即数据表格中每一个格子图形的最小面积应当至少能够容得下一个完整的字符。因此,可以设置有如下的针对图形面积特征的规则:数据表格中的格子图形的最小面积应当大于预设的面积阈值。又考虑到基于人们通常的排版习惯,在编辑表格数据时会将表格数据设置为居中的位置。因此,还可以设置有如下针对图形位置特征的规则:数据表格的左侧边界与图像的左侧边界的距离同数据表格右侧边界与图像的右侧边界的距离的差值的绝对值小于预设的距离阈值。还考虑到在使用表格数据的目的,通常为了将至少两个或者更多个数据列成表格进行对比、比较,以便更加清晰地展示不同数据之间的差异。因此,还可以设置有如下针对图形的数量特征的规则:数据表格中的格子图形的数量大于等于预设的数量阈值(例如,2个)等。
当然,需要说明的是,上述所列举的预设的表格格式要求所包含的具体规则只是为了更好地说明本说明书实施方式。具体实施时,根据具体的应用场景和处理要求,还 可以引入其他类型或内容的规则作为上述预设的表格格式要求。对此,本说明书不作限定。
在一个实施例中,上述根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求,具体实施时,可以包括以下内容:根据所述矩形框的端点坐标,计算所述矩形框的面积;检测所述矩形框的面积是否大于预设的面积阈值。如果所述矩形框的面积大于预设的面积阈值,判断所述组合图满足预设的表格格式要求。
在一个实施例中,上述根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求,具体实施时,也可以包括以下内容:根据组合图中矩形框的端点坐标分别确定组合图中左侧边界的横坐标与右侧边界的横坐标;根据所述组合图中左侧边界的横坐标计算组合图的左侧边界与图像数据的左侧边界的距离,记为第一距离;根据所述组合图中右侧边界的横坐标计算组合图的右侧边界与图像数据的右侧边界的距离,记为第二距离;计算第一距离与第二距离的距离差值的绝对值,将所述差值的绝对值与预设的距离阈值进行比较,检测所述距离差值的绝对值是否小于预设的距离阈值。如果所述距离差值的绝对值小于预设的距离阈值,判断组合图满足预设的表格格式要求。
当然,需要说明的是,上述所列举的判断组合图是否满足预设的表格格式要求的方式只是为了更好地说明本说明书实施方式。具体实施时,根据具体情况和精度要求,可以将上述两种判断方式组合,也可以引入其他合适的判断方式来判断组合图是否符合预设的表格格式要求。对此,本说明书不作限定。
在一个实施例中,上述将所述组合图分割成多个矩形单元,具体实施时,可以包括以下内容:根据矩形框的端点坐标确定分割线;根据所述分割线将所述组合图分割成多个矩形单元,并根据所述矩形框的端点坐标生成与所述矩形框对应的矩形单元的位置坐标。
在一个实施例中,所述待处理文本的图像数据具体可以包括:包含待处理合同的扫描图像或照片等。当然,需要说明的是,上述所列举的待处理文本的图像数据只是为了更好地说明本说明书实施方式。具体实施时,根据具体的应用场景和处理要求,上述待处理文本的图像数据还可以包括其他类型、内容的图像数据,例如,包含有待处理说明书的视频截图等等。对此,本说明书不作限定。
由上可见,本说明书实施例提供的表格数据的获取方法,由于通过获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再将组合图分割成多 个矩形单元分,对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而根据矩形单元的位置坐标将包含有文本信息的矩形单元进行组合还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容;还通过在提取得到组合图后,根据组合图所包含的交点、矩形框等图形因素,检测所提取的组合图是否是文本中的表格数据,从而避免将非表格数据错误识别成了表格,减少了误差,提高了获取表格数据的精度。
本说明书实施例还提供了一种服务器,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器具体实施时可以根据指令执行以下步骤:获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
为了能够更加准确地完成上述指令,参阅图6所示,本说明书还提供了另一种具体的服务器,其中,所述服务器包括网络通信端口601、处理器602以及存储器603,上述结构通过内部线缆相连,以便各个结构可以进行具体的数据交互。
其中,所述网络通信端口601,具体可以用于输入待处理文本的图像数据;
所述处理器602,具体可以用于从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
所述存储器603,具体可以用于存储经网络通信端口601输入的待处理文本的图像数据,以及存储处理器602所基于的相应的指令程序。
在本实施方式中,所述网络通信端口601可以是与不同的通信协议进行绑定,从而可以发送或接收不同数据的虚拟端口。例如,所述网络通信端口可以是负责进行web数据通信的80号端口,也可以是负责进行FTP数据通信的21号端口,还可以是负责进行邮件数据通信的25号端口。此外,所述网络通信端口还可以是实体的通信接口或者 通信芯片。例如,其可以为无线移动网络通信芯片,如GSM、CDMA等;其还可以为Wifi芯片;其还可以为蓝牙芯片。
在本实施方式中,所述处理器602可以按任何适当的方式实现。例如,处理器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式等等。本说明书并不作限定。
在本实施方式中,所述存储器603可以包括多个层次,在数字***中,只要能保存二进制数据的都可以是存储器;在集成电路中,一个没有实物形式的具有存储功能的电路也叫存储器,如RAM、FIFO等;在***中,具有实物形式的存储设备也叫存储器,如内存条、TF卡等。
本说明书实施例还提供了一种基于上述表格数据的获取方法的计算机存储介质,所述计算机存储介质存储有计算机程序指令,在所述计算机程序指令被执行时实现:获取待处理文本的图像数据;从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
在本实施方式中,上述存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、缓存(Cache)、硬盘(Hard Disk Drive,HDD)或者存储卡(Memory Card)。所述存储器可以用于存储计算机程序指令。网络通信单元可以是依照通信协议规定的标准设置的,用于进行网络连接通信的接口。
在本实施方式中,该计算机存储介质存储的程序指令具体实现的功能和效果,可以与其它实施方式对照解释,在此不再赘述。
参阅图7所示,在软件层面上,本说明书实施例还提供了一种表格数据的获取装置,该装置具体可以包括以下的结构模块:
获取模块71,具体可以用于获取待处理文本的图像数据;
提取模块72,具体可以用于从所述图像数据中提取组合图,其中,所述组合图为 包含有交叉的形态学竖线和形态学横线的图形;
分割模块73,具体可以用于将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;
识别模块74,具体可以用于对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;
组合模块75,具体可以用于根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
在一个实施例中,所述提取模块72具体可以包括以下结构单元:
第一搜索单元,具体可以用于搜索并获取所述图像数据中的形态学竖线和形态学横线;
连接单元,具体可以用于连接所述形态学竖线和所述形态学横线,得到所述组合图。
在一个实施例中,所述装置具体还可以包括检测模块,用于检测所述组合图是否满足预设的表格格式要求。其中,所述检测模块具体可以包括以下结构单元:
获取单元,具体可以用于获取所述组合图中的交点坐标,其中,所述交点具体可以为所述组合图中形态学竖线和形态学横线相交位置处的像素点;
第二搜索单元,具体可以用于搜索并获取所述组合图中的矩形框;
第一确定单元,具体可以用于根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;
第二确定单元,具体可以用于根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求。
在一个实施例中,所述第二确定单元具体可以用于根据所述矩形框的端点坐标,计算所述矩形框的面积;检测所述矩形框的面积是否大于预设的面积阈值。
在一个实施例中,所述分割模块73具体可以包括以下结构单元:
第三确定单元,具体可以用于根据矩形框的端点坐标确定分割线;
分割单元,具体可以用于根据所述分割线将所述组合图分割成多个矩形单元,并根据所述矩形框的端点坐标生成与所述矩形框对应的矩形单元的位置坐标。
在一个实施例中,所述装置还具体可以包括预处理模块,用于对所述待处理文本的图像数据进行预处理,其中,所述预处理具体可以包括:将所述图像数据转换为灰度图像;和/或,对所述图像数据进行高斯平滑处理等等。
在一个实施例中,所述待处理文本的图像数据具体可以包括:包含待处理合同的扫描图像或照片等。当然,需要说明的是,上述所列举的待处理文本的图像数据只是为了更好地说明本说明书实施方式。具体实施时,根据具体的应用场景和处理要求,上述待处理文本的图像数据还可以包括其他类型、内容的图像数据,例如,包含有待处理说明书的视频截图等等。对此,本说明书不作限定。
需要说明的是,上述实施例阐明的单元、装置或模块等,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。为了描述的方便,描述以上装置时以功能分为各种模块分别描述。当然,在实施本说明书时可以把各模块的功能在同一个或多个软件和/或硬件中实现,也可以将实现同一功能的模块由多个子模块或子单元的组合实现等。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
由上可见,本说明书实施例提供的表格数据的获取装置,由于通过提取模块获取并根据图像数据中的形态学竖线和形态学横线等图形特征提取得到组合图;再通过分割模块和识别模块将组合图分割成多个矩形单元分,并对各个矩形单元别进行光学字符识别,得到各个矩形单元所包含的文本信息,进而通过组合模块根据矩形单元的位置坐标将包含有文本信息的矩形单元进行组合还原得到完整的表格数据,从而解决了现有方法中存在的提取表格数据误差大、不准确的技术问题,达到能够高效、精确地识别,并完整还原得到图像数据中的表格内容;还通过在提取得到组合图后,通过组合模块根据组合图所包含的交点、矩形框等图形因素,检测所提取的组合图是否是文本中的表格数据,从而避免将非表格数据错误识别成了表格,减少了误差,提高了获取表格数据的精度。
虽然本说明书提供了如实施例或流程图所述的方法操作步骤,但基于常规或者无创造性的手段可以包括更多或者更少的操作步骤。实施例中列举的步骤顺序仅仅为众多步骤执行顺序中的一种方式,不代表唯一的执行顺序。在实际中的装置或客户端产品执行时,可以按照实施例或者附图所示的方法顺序执行或者并行执行(例如并行处理器或 者多线程处理的环境,甚至为分布式数据处理环境)。术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、产品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、产品或者设备所固有的要素。在没有更多限制的情况下,并不排除在包括所述要素的过程、方法、产品或者设备中还存在另外的相同或等同要素。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内部包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构、类等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,移动终端,服务器,或者网络设备等)执行本说明书各个实施例或者实施例的某些部分所述的方法。
本说明书中的各个实施例采用递进的方式描述,各个实施例之间相同或相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。本说明书可用于众多通用或专用的计算机***环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器***、基于微处理器的***、置顶盒、可编程的电子设备、网络PC、小型计算机、大型计算机、包括以上任何***或设备的分布式计算环境等等。
虽然通过实施例描绘了本说明书,本领域普通技术人员知道,本说明书有许多变 形和变化而不脱离本说明书的精神,希望所附的权利要求包括这些变形和变化而不脱离本说明书的精神。

Claims (16)

  1. 一种表格数据的获取方法,包括:
    获取待处理文本的图像数据;
    从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;
    将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;
    对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;
    根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
  2. 根据权利要求1所述的方法,从所述图像数据中提取组合图,包括:
    搜索并获取所述图像数据中的形态学竖线和形态学横线;
    连接所述形态学竖线和所述形态学横线,得到所述组合图。
  3. 根据权利要求1所述的方法,在从所述图像数据中提取组合图后,所述方法还包括:
    获取所述组合图中的交点坐标,其中,所述交点为所述组合图中形态学竖线和形态学横线相交位置处的像素点;
    搜索并获取所述组合图中的矩形框;
    根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;
    根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求。
  4. 根据权利要求3所述的方法,根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求,包括:
    根据所述矩形框的端点坐标,计算所述矩形框的面积;
    检测所述矩形框的面积是否大于预设的面积阈值。
  5. 根据权利要求3所述的方法,将所述组合图分割成多个矩形单元,包括:
    根据矩形框的端点坐标确定分割线;
    根据所述分割线将所述组合图分割成多个矩形单元,并根据所述矩形框的端点坐标生成与所述矩形框对应的矩形单元的位置坐标。
  6. 根据权利要求1所述的方法,在获取待处理文本的图像数据后,所述方法还包括:
    对所述待处理文本的图像数据进行预处理,其中,所述预处理包括:将所述图像数据转换为灰度图像;和/或,对所述图像数据进行高斯平滑处理。
  7. 根据权利要求1所述的方法,所述待处理文本的图像数据包括:包含待处理合同 的扫描图像或照片。
  8. 一种表格数据的获取装置,包括:
    获取模块,用于获取待处理文本的图像数据;
    提取模块,用于从所述图像数据中提取组合图,其中,所述组合图为包含有交叉的形态学竖线和形态学横线的图形;
    分割模块,用于将所述组合图分割成多个矩形单元,其中,所述多个矩形单元分别携带有位置坐标;
    识别模块,用于对所述多个矩形单元分别进行光学字符识别,确定所述多个矩形单元分别包含的文本信息;
    组合模块,用于根据矩形单元的位置坐标,组合包含有文本信息的矩形单元,得到表格数据。
  9. 根据权利要求8所述的装置,所述提取模块包括:
    第一搜索单元,用于搜索并获取所述图像数据中的形态学竖线和形态学横线;
    连接单元,用于连接所述形态学竖线和所述形态学横线,得到所述组合图。
  10. 根据权利要求8所述的装置,所述装置还包括检测模块,所述检测模块包括:
    获取单元,用于获取所述组合图中的交点坐标,其中,所述交点为所述组合图中形态学竖线和形态学横线相交位置处的像素点;
    第二搜索单元,用于搜索并获取所述组合图中的矩形框;
    第一确定单元,用于根据所述组合图中的交点坐标,确定所述矩形框的端点坐标;
    第二确定单元,用于根据所述矩形框的端点坐标,确定所述组合图是否满足预设的表格格式要求。
  11. 根据权利要求10所述的装置,所述第二确定单元具体用于根据所述矩形框的端点坐标,计算所述矩形框的面积;检测所述矩形框的面积是否大于预设的面积阈值。
  12. 根据权利要求10所述的装置,所述分割模块包括:
    第三确定单元,用于根据矩形框的端点坐标确定分割线;
    分割单元,用于根据所述分割线将所述组合图分割成多个矩形单元,并根据所述矩形框的端点坐标生成与所述矩形框对应的矩形单元的位置坐标。
  13. 根据权利要求8所述的装置,所述装置还包括预处理模块,用于对所述待处理文本的图像数据进行预处理,其中,所述预处理包括:将所述图像数据转换为灰度图像;和/或,对所述图像数据进行高斯平滑处理。
  14. 根据权利要求8所述的装置,所述待处理文本的图像数据包括:包含待处理合同 的扫描图像或照片。
  15. 一种服务器,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现权利要求1至7中任一项所述方法的步骤。
  16. 一种计算机可读存储介质,其上存储有计算机指令,所述指令被执行时实现权利要求1至7中任一项所述方法的步骤。
PCT/CN2019/124101 2019-01-04 2019-12-09 表格数据的获取方法、装置和服务器 WO2020140698A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910006706.1 2019-01-04
CN201910006706.1A CN110008809B (zh) 2019-01-04 2019-01-04 表格数据的获取方法、装置和服务器

Publications (1)

Publication Number Publication Date
WO2020140698A1 true WO2020140698A1 (zh) 2020-07-09

Family

ID=67165348

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124101 WO2020140698A1 (zh) 2019-01-04 2019-12-09 表格数据的获取方法、装置和服务器

Country Status (2)

Country Link
CN (1) CN110008809B (zh)
WO (1) WO2020140698A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881883A (zh) * 2020-08-10 2020-11-03 晶璞(上海)人工智能科技有限公司 一种基于卷积特征提取与形态学处理的表格文档提取方法
CN112364834A (zh) * 2020-12-07 2021-02-12 上海叠念信息科技有限公司 一种基于深度学习和图像处理的表格识别的还原方法
CN112712014A (zh) * 2020-12-29 2021-04-27 平安健康保险股份有限公司 表格图片结构解析方法、***、设备和可读存储介质
CN114926852A (zh) * 2022-03-17 2022-08-19 支付宝(杭州)信息技术有限公司 表格识别重构方法、装置、设备、介质及程序产品

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008809B (zh) * 2019-01-04 2020-08-25 阿里巴巴集团控股有限公司 表格数据的获取方法、装置和服务器
CN110675384B (zh) * 2019-09-24 2022-06-07 广东博智林机器人有限公司 图像处理方法及装置
CN111126409B (zh) * 2019-12-26 2023-08-18 南京巨鲨显示科技有限公司 一种医学图像区域识别方法及***
CN111160234B (zh) * 2019-12-27 2020-12-08 掌阅科技股份有限公司 表格识别方法、电子设备及计算机存储介质
CN111027521B (zh) * 2019-12-30 2023-12-29 上海智臻智能网络科技股份有限公司 文本处理方法及***、数据处理设备及存储介质
CN111325110B (zh) * 2020-01-22 2024-04-05 平安科技(深圳)有限公司 基于ocr的表格版式恢复方法、装置及存储介质
CN113343740B (zh) * 2020-03-02 2022-05-06 阿里巴巴集团控股有限公司 表格检测方法、装置、设备和存储介质
CN111460774B (zh) * 2020-04-02 2023-06-30 北京易优联科技有限公司 曲线中数据的还原方法、装置、存储介质、电子设备
CN111640130A (zh) * 2020-05-29 2020-09-08 深圳壹账通智能科技有限公司 表格还原方法及装置
CN111757182B (zh) * 2020-07-08 2022-05-31 深圳创维-Rgb电子有限公司 图像花屏检测方法、设备、计算机设备和可读存储介质
CN111985506A (zh) * 2020-08-21 2020-11-24 广东电网有限责任公司清远供电局 一种图表信息提取方法、装置和存储介质
CN112200117B (zh) * 2020-10-22 2023-10-13 长城计算机软件与***有限公司 表格识别方法及装置
CN112733855B (zh) * 2020-12-30 2024-04-09 科大讯飞股份有限公司 表格结构化方法、表格恢复设备及具有存储功能的装置
CN112861736B (zh) * 2021-02-10 2022-08-09 上海大学 基于图像处理的文献表格内容识别与信息提取方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130016381A1 (en) * 2011-07-12 2013-01-17 Fuji Xerox Co., Ltd. Image processing apparatus, non-transitory computer readable medium storing program and image processing method
CN104462044A (zh) * 2014-12-16 2015-03-25 上海合合信息科技发展有限公司 表格图像识别编辑方法及装置
CN105426856A (zh) * 2015-11-25 2016-03-23 成都数联铭品科技有限公司 一种图像表格文字识别方法
CN108132916A (zh) * 2017-11-30 2018-06-08 厦门市美亚柏科信息股份有限公司 解析pdf表格数据的方法、存储介质
CN110008809A (zh) * 2019-01-04 2019-07-12 阿里巴巴集团控股有限公司 表格数据的获取方法、装置和服务器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6996295B2 (en) * 2002-01-10 2006-02-07 Siemens Corporate Research, Inc. Automatic document reading system for technical drawings
CN107622230B (zh) * 2017-08-30 2019-12-06 中国科学院软件研究所 一种基于区域识别与分割的pdf表格数据解析方法
CN107943857A (zh) * 2017-11-07 2018-04-20 中船黄埔文冲船舶有限公司 自动读取AutoCAD表格的方法、装置、终端设备与存储介质
CN109086714B (zh) * 2018-07-31 2020-12-04 国科赛思(北京)科技有限公司 表格识别方法、识别***及计算机装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130016381A1 (en) * 2011-07-12 2013-01-17 Fuji Xerox Co., Ltd. Image processing apparatus, non-transitory computer readable medium storing program and image processing method
CN104462044A (zh) * 2014-12-16 2015-03-25 上海合合信息科技发展有限公司 表格图像识别编辑方法及装置
CN105426856A (zh) * 2015-11-25 2016-03-23 成都数联铭品科技有限公司 一种图像表格文字识别方法
CN108132916A (zh) * 2017-11-30 2018-06-08 厦门市美亚柏科信息股份有限公司 解析pdf表格数据的方法、存储介质
CN110008809A (zh) * 2019-01-04 2019-07-12 阿里巴巴集团控股有限公司 表格数据的获取方法、装置和服务器

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881883A (zh) * 2020-08-10 2020-11-03 晶璞(上海)人工智能科技有限公司 一种基于卷积特征提取与形态学处理的表格文档提取方法
CN112364834A (zh) * 2020-12-07 2021-02-12 上海叠念信息科技有限公司 一种基于深度学习和图像处理的表格识别的还原方法
CN112712014A (zh) * 2020-12-29 2021-04-27 平安健康保险股份有限公司 表格图片结构解析方法、***、设备和可读存储介质
CN112712014B (zh) * 2020-12-29 2024-04-30 平安健康保险股份有限公司 表格图片结构解析方法、***、设备和可读存储介质
CN114926852A (zh) * 2022-03-17 2022-08-19 支付宝(杭州)信息技术有限公司 表格识别重构方法、装置、设备、介质及程序产品

Also Published As

Publication number Publication date
CN110008809A (zh) 2019-07-12
CN110008809B (zh) 2020-08-25

Similar Documents

Publication Publication Date Title
WO2020140698A1 (zh) 表格数据的获取方法、装置和服务器
US20210256253A1 (en) Method and apparatus of image-to-document conversion based on ocr, device, and readable storage medium
WO2019119966A1 (zh) 文字图像处理方法、装置、设备及存储介质
CN110942074B (zh) 字符切分识别方法、装置、电子设备、存储介质
CN109753953B (zh) 图像中定位文本的方法、装置、电子设备和存储介质
CN109410215A (zh) 图像处理方法、装置、电子设备及计算机可读介质
CN105469027A (zh) 针对文档图像的水平和垂直线检测和移除
CN109948521B (zh) 图像纠偏方法和装置、设备及存储介质
US20180082456A1 (en) Image viewpoint transformation apparatus and method
US20190266431A1 (en) Method, apparatus, and computer-readable medium for processing an image with horizontal and vertical text
CN113642584A (zh) 文字识别方法、装置、设备、存储介质和智能词典笔
CN114359932B (zh) 文本检测方法、文本识别方法及装置
CN115719356A (zh) 图像处理方法、装置、设备和介质
CN116844177A (zh) 一种表格识别方法、装置、设备及存储介质
CN112651953B (zh) 图片相似度计算方法、装置、计算机设备及存储介质
CN108304840B (zh) 一种图像数据处理方法以及装置
CN113486881A (zh) 一种文本识别方法、装置、设备及介质
CN112507938A (zh) 一种文本图元的几何特征计算方法及识别方法、装置
CN115620321B (zh) 表格识别方法及装置、电子设备和存储介质
CN114120305B (zh) 文本分类模型的训练方法、文本内容的识别方法及装置
US11570331B2 (en) Image processing apparatus, image processing method, and storage medium
JP2012003358A (ja) 背景判別装置、方法及びプログラム
CN115019321A (zh) 一种文本识别、模型训练方法、装置、设备及存储介质
CN114511862A (zh) 表格识别方法、装置及电子设备
CN114140805A (zh) 图像处理方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19907609

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19907609

Country of ref document: EP

Kind code of ref document: A1