CN113496119A - Method, electronic device and computer readable medium for extracting tuple data in table - Google Patents

Method, electronic device and computer readable medium for extracting tuple data in table Download PDF

Info

Publication number
CN113496119A
CN113496119A CN202010199516.9A CN202010199516A CN113496119A CN 113496119 A CN113496119 A CN 113496119A CN 202010199516 A CN202010199516 A CN 202010199516A CN 113496119 A CN113496119 A CN 113496119A
Authority
CN
China
Prior art keywords
attribute
type
group
tuple data
cell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010199516.9A
Other languages
Chinese (zh)
Other versions
CN113496119B (en
Inventor
林得苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pai Tech Co ltd
Original Assignee
Pai Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pai Tech Co ltd filed Critical Pai Tech Co ltd
Priority to CN202010199516.9A priority Critical patent/CN113496119B/en
Publication of CN113496119A publication Critical patent/CN113496119A/en
Application granted granted Critical
Publication of CN113496119B publication Critical patent/CN113496119B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present disclosure disclose methods, electronic devices, and computer-readable media for extracting tuple data in a table. One embodiment of the method comprises: the method comprises the steps of obtaining a table and a predetermined attribute list, determining first tuple data corresponding to the attribute list based on the table and the attribute list, determining the type of an extension mechanism of the table, determining new tuple data according to the table, the type of the extension mechanism and the first tuple data, and determining the tuple data corresponding to the attribute list in the table as result tuple data based on the first tuple data and the new tuple data. The method can automatically extract the tuple data according to the form and the predetermined attribute list without manual intervention, is suitable for various form types, has strong generalization capability and is convenient for users to extract the tuple data in the form.

Description

Method, electronic device and computer readable medium for extracting tuple data in table
Technical Field
The disclosed embodiments relate to the field of information extraction, and in particular, to a method for extracting table information, an electronic device, and a computer-readable medium.
Background
Information extraction (information extraction) may generally refer to the extraction of specific event or fact information from a source document. Information extraction techniques may be used for automatic classification, extraction, reconstruction, etc. of content. Extracting data from a table generally means analyzing the content of the table by using an information extraction technique to extract specific content or information from the table.
Disclosure of Invention
The embodiment of the disclosure provides a method for extracting tuple data in a table.
In a first aspect, an embodiment of the present disclosure provides a method for extracting tuple data in a table, where the method includes: acquiring a form and a predetermined attribute list; determining a first piece of metadata corresponding to the attribute list based on the tables and the attribute list, wherein the first piece of metadata consists of a first attribute value in the table corresponding to each attribute name in the attribute list; determining the type of an extension mechanism of the table, wherein the extension mechanism is a rule for obtaining the tuple data in the table corresponding to each attribute name in the attribute list based on the first tuple data; determining new tuple data according to the table, the type of the extension mechanism and the first tuple data; and determining the meta-group data of the corresponding attribute list in the table as result meta-group data based on the first piece of meta-group data and the new meta-group data.
In some embodiments, the method further comprises: in response to the result tuple data corresponding to the target tuple data of the attribute list, transmitting the target tuple data to the device supporting the display function and the storage device, the control device displaying the target tuple data, and the control storage device storing the target tuple data.
In some embodiments, determining the type of extension mechanism for the table includes: determining a table type of a table, wherein the table type comprises a non-extensible group type and an extensible group type; determining the type of the extension mechanism as a non-extensible group mechanism type in response to the table type being the non-extensible group type; in response to the table type being an extensible group type, determining the type of the extension mechanism to be an extensible group mechanism type.
In some embodiments, having an extensible group type refers to a table type in which cells may span a table of rows or columns, and the contents of the cells are a table of a temporal type, specifically, table 3 is a table having an extensible group type.
In some embodiments, determining, based on the table and the attribute list, a first piece of metadata corresponding to the attribute list includes: inputting the form into a first neural network trained in advance to generate a form code; for each attribute name in the attribute list, inputting the attribute name into a first neural network to generate an attribute code; for each attribute name in the attribute list, determining a cell in a table corresponding to the attribute name based on the table code and the attribute code of the attribute name; and determining first piece of metadata corresponding to the attribute list according to the cells in the table corresponding to each attribute name in the attribute list.
In some embodiments, for each attribute name in the attribute list, determining that the attribute corresponds to a cell in the table based on the table code and the attribute code of the attribute name includes: generating a first sheet feature based on the table code and the attribute code; inputting the first vector characteristics into a pre-trained second neural network to obtain the possibility that each cell in the table corresponds to the attribute name; based on the determined likelihood, a cell corresponding to the attribute name is determined from the table.
In some embodiments, generating the first sheet feature based on the table code and the attribute code comprises: obtaining tensorial representation based on the table coding and the attribute coding; and adding position information of a table to the tensorial representation to generate a first volume feature, wherein the position information of the table comprises position information in the horizontal direction and position information in the vertical direction.
In some embodiments, generating new tuple data from the table, the extension mechanism, and the first piece of tuple data comprises: in response to the type of the extension mechanism being an extensible group mechanism type, executing an intra-group extension operation and an inter-group extension operation in the table based on the first meta-group data to generate new meta-group data; and in response to the expansion mechanism being a non-expandable group mechanism, executing an intra-group expansion operation in the table based on the first meta-group data to generate new meta-group data.
In a second aspect, an embodiment of the present disclosure provides a terminal device, where the terminal device includes: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.
In a third aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method as described in any implementation manner of the first aspect.
The embodiment of the disclosure provides a method for extracting tuple data in a table, which includes the steps of obtaining the table and a predetermined attribute list, determining a first piece of tuple data corresponding to the attribute list based on the table and the attribute list, determining a type of an extension mechanism of the table, determining new tuple data according to the table, the type of the extension mechanism and the first piece of tuple data, and determining the tuple data corresponding to the attribute list in the table as result tuple data based on the first piece of tuple data and the new tuple data.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: based on the table and the predetermined attribute list, the cell where the first target value corresponding to each attribute name in the predetermined attribute list is located and the first metadata corresponding to the predetermined attribute list can be automatically predicted, and the prediction can be performed on the table in any form without adopting a manually set rule template. And classifying the tables according to the table structure, and determining new metadata of different types of tables by adopting different types of extension mechanisms. Based on the first piece of tuple data and the new tuple data, the resulting tuple data is determined. According to the embodiment of the disclosure, the tuple data can be automatically extracted without manual intervention according to the form and the predetermined attribute list, the embodiment is suitable for various form types, the generalization capability is strong, and a user can conveniently extract the tuple data in the form.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an architectural diagram of an exemplary system in which some embodiments of the present disclosure may be applied;
FIG. 2 is a flow diagram of some embodiments of a method of extracting tuple data in a table according to the present disclosure;
FIG. 3 is a flow diagram for one embodiment of a first training step for training a first neural network, in accordance with the present disclosure;
FIG. 4 is a flow diagram of one embodiment of a method to determine cells in an attribute name correspondence table according to the present disclosure;
FIG. 5 is a flow diagram for one embodiment of a method for generating new tuple data according to the present disclosure;
FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which an embodiment of the disclosed method of extracting tuple data in a table may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as an information extraction application, a data analysis application, a natural language processing application, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various terminal devices having a display screen, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the above-listed terminal apparatuses. It may be implemented as multiple software or software modules (e.g., to provide form entry, text entry, etc.), or as a single software or software module. And is not particularly limited herein.
The server 105 may be a server that provides various services, such as a server that extracts tuple data from a form input by the terminal apparatuses 101, 102, 103 and outputs the resultant tuple data, a server that processes a form input by the terminal apparatuses 101, 102, 103 and outputs tuple data, or the like. The server may perform processing such as tuple data extraction on the received table, and feed back a processing result (e.g., result tuple data) to the terminal device.
It should be noted that the method for extracting tuple data in the table provided by the embodiment of the present disclosure may be executed by the server 105 or by the terminal device.
It should be noted that the local area of the server 105 may also directly store the table, and the server 105 may directly extract the local table to obtain the result tuple data through tuple data extraction, in this case, the exemplary system architecture 100 may not include the terminal devices 101, 102, 103 and the network 104.
It should be noted that the terminal devices 101, 102, and 103 may also be installed with a tuple data extraction application, and in this case, the method of extracting tuple data in the table may also be executed by the terminal devices 101, 102, and 103. At this point, the exemplary system architecture 100 may also not include the server 105 and the network 104.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (for example, for providing a meta data extraction service), or may be implemented as a single software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of some embodiments of a method of extracting tuple data in a table in accordance with the present disclosure is shown. The method for extracting the tuple data in the table comprises the following steps of:
step 201, a table and a predetermined attribute list are obtained.
In some embodiments, an executive body of the method of extracting tuple data in a table (e.g., the terminal device shown in fig. 1) may obtain the table. The table is two-dimensional structured data, and the table is composed of row headers, column headers and cells. The coordinates of each cell in the table are composed of coordinates in two dimensions, namely a vertical direction and a horizontal direction, the coordinates represent the row number and the column number of the cell, and the row number and the column number can be 0 at the minimum. Adjacent cells may be merged into one cell, and the coordinate of the merged cell may be the minimum cell coordinate of the upper left corner thereof. The cells formed by merging the multiple columns of cells are called column merging cells, and the cells formed by merging the multiple rows of cells become row merging cells. The content in the cells may be of various types, including but not limited to: numeric, temporal, textual, etc. The content of the row header and the column header may be text type.
In some embodiments, the execution agent may obtain a predetermined list of attributes. The attribute list is a set of attributes defined artificially, and each component in the attribute list is an attribute name. The attribute list is corresponding to the table, and the table may have the content of the cell corresponding to the attribute name, that is, the attribute name and one or some cells in the table have hidden semantic correspondence. There may also be no contents of the cells corresponding to the attribute names in the table, i.e. there is no hidden semantic correspondence between the attribute names and any cell in the table.
Specifically, as shown in table 1 below, the cell coordinate in which the attribute name "serial number" is located is (0,0), the cell coordinate in which the attribute name "item name" is located is (0,1), and the cell coordinate in which the attribute name "total" is located is (4, 0). For the attribute name "examination and approval document number", the corresponding cell does not exist in the following table, so that the coordinate of the cell where the attribute name "examination and approval document number" is located is null.
Table 1
Figure BDA0002418870710000071
Step 202, based on the table and the attribute list, a first piece of metadata corresponding to the attribute list is determined.
In some embodiments, the first meta-data is composed of the first attribute value in the table corresponding to each attribute name in the attribute list. Optionally, in table 1, the first metadata corresponding to the attribute list "serial number, project name, total investment amount, use amount of collected fund, and construction period" is "1, and the new generation of memory interface chip research and development and industrialization project, 101785.00, 101785.00, and 3 years".
The executing body may input a table into the pre-trained first neural network, where the table is a table having h cells per column and w cells per row, and input a matrix of h × w dimensions into the pre-trained first neural network. Optionally, the cell content may be in a variety of formats, including but not limited to one of: numerical content, textual content, and so forth. Each element in the h x w dimensional matrix is the content of a cell in the table. The output of the content of each cell in the first neural network corresponding table is a vector with 768 dimensions, and the table output by the first neural network corresponding table is coded into a tensor with h × w × 768 dimensions. Optionally, the tensor is a generalization of the concept of a vector, and the vector is a first-order tensor. The tensor is a set of three independent variables related to coordinates in a three-dimensional cartesian (descales) coordinate system. Optionally, the table is encoded as a tensor of h × w × 768 dimensions, where h and w are the number of rows and columns of the table, respectively.
Optionally, the executing agent may input each attribute name in the attribute list to the first neural network. Wherein the attribute name is the content in text format, and the attribute name of 1 x 1 dimension is input into the first neural network. The output of the first neural network corresponding to the attribute name is a 768-dimensional vector, and the 768-dimensional vector output is expanded into an h x w x 768-dimensional tensor, wherein h and w are the row number and the column number of the table respectively. The expansion mode is to copy and paste the element values in the vector along the directions of rows and columns respectively. The h × w × 768-dimensional tensor encodes an attribute of each attribute name in the attribute list.
In some optional implementations of some embodiments, for each attribute name in the attribute list, a tensor of h × w × 1536 dimensions is obtained based on the table coding and the attribute coding, where h and w are the number of rows and columns of the table, respectively. Optionally, a processing manner of splicing may be used to obtain a tensor of h × w × 1536 dimensions, or a tensor may be obtained by inputting several fully-connected layers after splicing, where the dimension of the tensor is determined by the condition of the fully-connected layers. The position information of the table is added to the tensor, and the first tensor feature of h, w and 1538 dimensions is generated. The position information of the table comprises position information of each cell in the table in the horizontal direction and position information of each cell in the table in the vertical direction. Specifically, for a cell at the current position in the table, coordinates of the cell are (i, j), i is a row coordinate, j is a column coordinate, then the position information of the cell in the horizontal direction is (i/w) × 2-1, the position information of the cell in the vertical direction is (j/h) × 2-1, h and w are the number of rows and columns of the table, respectively, and a value of the position information is between-1 and 1. The position information of each cell in the table in the horizontal direction constitutes position information of h × w × 1 dimension in the horizontal direction of the table, and the position information of each cell in the table in the vertical direction constitutes position information of h × w × 1 dimension in the vertical direction of the table. The horizontal position information of dimension h × w × 1 and the vertical position information of dimension h × w × 1 are connected to the tensor of dimension h × w × 1536 along the third dimension, and the first tensor feature of dimension h × w × 1538 is generated.
The first neural network may be obtained by pre-training through a first training step.
Referring to fig. 3, fig. 3 illustrates a flow 300 of one embodiment of a first training step to pre-train a first neural network according to the present disclosure. The first training step may comprise the steps of:
step 301, a training sample set is obtained, wherein the training sample includes a sample table and a pre-obtained sample table code.
In some embodiments, the execution subject of the first training step may be the same as or different from the execution subject of the method of extracting the tuple data in the table (e.g., the terminal device shown in fig. 1). If the first training step is the same as the second training step, the executing agent of the first training step can store the trained network structure information of the first neural network and the parameter values of the network parameters locally after the first neural network is trained. If the first training step is different from the second training step, the executive agent of the first training step may send the trained network structure information of the first neural network and the parameter values of the network parameters to the executive agent of the method for extracting the tuple data in the table after the first neural network is obtained through training.
In some embodiments, the executing agent of the first training step may obtain the training sample set locally or remotely from other terminal devices networked with the executing agent. The training samples in the training sample set comprise sample tables and sample table codes obtained in advance.
In some optional implementations of some embodiments, the sample table included in the training sample and the pre-obtained sample table code are obtained by the following steps. The method comprises the following steps of firstly, labeling the value corresponding to each attribute name in an attribute list of a sample table in a training sample, and generating a labeled sample table data set. Optionally, it is assumed that the table is T { (i, j) |0 ≦ i ≦ h-1, and 0 ≦ j ≦ w-1}, where (i, j) is a cell coordinate, i is a row coordinate, j is a column coordinate, h and w are a table row number and a table column number, respectively, and T represents the table. If there are merged cells in the table, then only the merged cells are contained in T. For an attribute name a in the attribute list of the table T, the cell coordinate V of the first value corresponding to the attribute name a is marked, and if the value corresponding to the attribute name does not exist in the table, the table is marked as empty. Optionally, the first value refers to a corresponding value that appears first from top to bottom and left to right when the table starts counting from the first row and the first column. Thus, a piece of label data (T, a, V) can be obtained, where T denotes a table, a denotes an attribute name, and V denotes a cell coordinate where a first value corresponding to the attribute name is located.
In a second step, optionally, a labeled data set (T, project name, (1,1)), (T, total investment, (1,2)), (T, intended investment asset recruitment, (1,3)), (T, project docket number, null), (T, approval number, null), (T, construction period, (1,4)) is obtained for table 1. Labeling all the tables to obtain a labeled data set D { (T)1,A1,V1),(T2,A2,V2),…,(Tn,An,Vn) N represents the number of attribute names A in the table, and each attribute name A corresponds to n cell coordinates V, V of a first value. D represents the annotated data set after annotation.
Third, optionally, the training samples include a sample table code corresponding to each table in the labeled dataset. And inputting the table in the labeled data set into the feature extraction model, and extracting the code corresponding to the table as a sample table code. Alternatively, the feature extraction model may be a convolutional neural network, an elastic network, or the like.
And coding the sample table and the sample table of the labeled data set as a training sample.
Step 302, determining a network structure of the first initial neural network and initializing network parameters of the first initial neural network.
In some embodiments, the performing agent of the first training step may first determine the network structure of the initial first neural network. For example, it is necessary to determine which layers the initial first neural network includes, the connection order relationship between layers, and which neurons each layer includes, the weight (weight) and bias term (bias) corresponding to each neuron, the activation function of each layer, and so on.
In some alternative implementations of some embodiments, the network structure of the first neural network may comprise two parts, an encoder and a decoder. The encoder portion is a multi-layered neural network, each layer consisting of one or more two-dimensional planes, and each plane consisting of a plurality of individual neurons. The input of the encoder is directly added through the embedding layer and the position coding layer, enters the multi-head attention mechanism layer, enters the forward feedback layer after being processed through the residual connecting layer and the batch normalization processing layer, and enters the decoder part after being processed through the residual layer and the batch normalization processing layer. The decoder part receives the output of the encoder part, firstly passes through a multi-head self-attention mechanism layer, is processed by a residual connecting layer and a batch standardization processing layer, then passes through a multi-head context text information attention mechanism layer, is processed by the residual connecting layer and the batch standardization processing layer, and finally passes through a forward feedback layer, and then is processed by a residual layer and a batch standardization processing layer to be used as the output of the encoder part.
The executing agent of the first training step may then initialize the network parameters of the initial first neural network. In practice, the network parameters (e.g., weight parameters and bias parameters) of the first neural network may be initialized with a number of different small random numbers. The small random number is used for ensuring that the network does not enter a saturation state due to overlarge weight value, so that training fails, and the different random numbers are used for ensuring that the network can normally learn.
Step 303, using a machine learning method, using a sample table included in a training sample set as an input of the initial first neural network, using a pre-obtained sample table code corresponding to the input sample table as an expected output of the initial first neural network, and training to obtain the first neural network.
In some embodiments, the performing subject of the first training step may train the first neural network by using a machine learning method, using a sample table included in the training samples in the training sample set as an input of the first initial neural network, and encoding a pre-obtained sample table corresponding to the input sample table as an expected output of the first initial neural network.
Specifically, the difference between the obtained table code and the sample table code in the training sample may be first calculated by using a preset loss function, and optionally, the difference between the obtained table code and the sample table code in the training sample may be calculated by using the L2 norm as the loss function. Alternatively, the loss function may be in the following format:
loss=-(∑ijtij_celllog yij_cell+∑iti_rowlog yi_row+∑jtj_columnlog yj_column)
wherein, loss represents loss function, and is composed of three cross entropy parts, which are respectively cell cross entropy-sigmaijtij_celllog yij_cellRow-wise cross entropy sigmaiti_rowlog yi_rowColumn cross entropy sigmajtj_columnlog yj_column. Where t represents the true value (ground true), tij_cellRepresenting the true value of the cell, ti_rowRepresenting the true value of the column, tj_columnThe true values of the rows are represented, i represents the row count, j represents the column count, ij represents the cell count, cell represents the cell, row represents the row, column represents the column. y denotes the prediction result, yij_cellIndicates the cell prediction result, yi_rowIndicating the prediction result of the line, yj_columnThe column prediction results are shown.
And marking the cells, rows and columns in the table. The target cell, target row, and target column are represented by (1,0), and the other cells, rows, and columns are represented by (0, 1). The height and width of the table are h and w, the target cell coordinate is (ty, tx), and the shape of the true value (ground route) of the cell is (h, w,2), where the value corresponding to the position (ty, tx) is (1,0), and the other positions are (0, 1). The true value (ground route) shape of the line is (h,2), where the position ty corresponds to a value of (1,0) and the remaining positions have values of (0, 1). The true value (groudtuth) shape of the column is (w,2), where the position tx corresponds to a value of (1,0) and the remaining positions have values of (0, 1). The predicted output is the same shape as the true value (ground truth). Training is performed by minimizing the difference between the predicted output and the true value (ground truth), i.e. minimizing the cross-entropy.
Optionally, the network parameters of the initial first neural network may be adjusted based on the calculated difference, and the training is ended when a preset training end condition is satisfied. For example, the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the calculated difference is less than a preset difference threshold.
Here, various implementations may be employed to adjust network parameters of the initial first neural network based on differences between the generated table code and the sample table codes in the training sample. For example, an Adam algorithm, a BP (Back Propagation) algorithm, or an SGD (Stochastic Gradient Descent) algorithm may be used to adjust the network parameters of the initial first neural network. Optionally, the Adam algorithm may be used to adjust the network parameters of the initial neural network, and independent adaptive learning rates may be designed for different parameters by calculating the first moment estimate and the second moment estimate of the gradient, so as to achieve iterative updating of the network parameters of the initial first neural network.
In some embodiments, the performing subject of the first training step determines the trained first neural network as a pre-trained first neural network.
The first training step of training the first neural network in advance provided by the above embodiments of the present disclosure trains the first neural network by using the sample table and the sample table code as training samples, so that the trained first neural network can learn how to generate the code. The table codes of the table and the codes of each attribute name in the predetermined attribute list are extracted by adopting the pre-trained first neural network model, so that the context information in the table can be effectively utilized, the robustness of the first neural network model is improved, and the coding quality is improved.
In some embodiments, the executing entity inputs the first scale feature into a pre-trained second neural network, obtains a probability that each cell is the same as the attribute name, and characterizes a probability that each cell in the table corresponds to the attribute name. The probability is compared with a preset threshold value, and in response to the comparison result being a cell that is "possibly the target cell", the coordinates of the cell are output, and the cell corresponding to the attribute name is determined from the table.
The second neural network is obtained by training in advance through the following second training steps: and acquiring a training sample set, wherein the training sample comprises sample tensor features and pre-obtained probabilities. Determining a network structure of the initial second neural network and initializing network parameters of the initial second neural network. And training to obtain the second neural network by using a machine learning method and taking the sample tensor characteristics included by the training samples in the training sample set as the input of the second initial neural network and taking the pre-obtained probability corresponding to the input sample tensor characteristics as the expected output of the second initial neural network.
Step 203, determine the type of extension mechanism of the table.
The extension mechanism is a rule for obtaining the tuple data in the table corresponding to each attribute name in the attribute list based on the first tuple data.
In some embodiments, the execution body determines the type of the extension mechanism of the table by performing the following steps.
First, determining that the type of the table comprises a non-extensible group type and an extensible group type, wherein the non-extensible group type comprises one of the following types: relational, solid, matrix. The rows of the relational table are titled in the first row, and optionally, table 1 is a relational table. Column headings for solid type tables are in the first column, and optionally, table 2 is a solid type table. The row and column headings of a matrix type table are typically in the first two rows and the first two columns, respectively, with an extensible group type referring to a table type where the cells in the table can span multiple rows or columns. In a table with an expandable group type, the content types of cells at corresponding positions of different cell groups are the same, and the content type of a cell in each cell group located in the first two rows or columns of the table is usually a time type. The table with the expandable group type means that the table contains expandable groups, wherein the expandable groups refer to combinations of cell groups, the types of cell contents forming the cell groups are the same, and optionally, the cell contents are all time types or the cell contents are all numerical types. There are two types of tables with expandable group types, namely, a horizontally expandable group and a vertically expandable group. The horizontal expandable groups are tables comprising expandable groups, the relative position relation among different expandable groups is horizontal, and the number of columns occupied by each expandable group is the same, and the number of rows occupied by each expandable group is the same. The longitudinal expandable groups are tables comprising expandable groups, the relative position relation among different expandable groups is vertical, and the number of columns occupied by each expandable group is the same, and the number of rows occupied by each expandable group is the same.
Table 2
Figure BDA0002418870710000131
In some alternative implementations of some embodiments, for tables with scalable group types, there are a horizontal scalable group table and a vertical scalable group table. Specifically, table 3 below is a horizontally extensible group table.
Table 3
Item 31/12/2018 31/12/2017 2016 (12 months) and 31 days
House and building 17,829.55 - -
Machine equipment 11,455.07 1,428.62 1,409.92
Transport device 414.11 346.20 108.61
Electronic devices and others 322.81 55.89 55.29
Total up to 30,021.54 1,830.72 1,573.82
Specifically, the following table 4 is a vertical extensible group table.
Table 4
Unit: ten thousand yuan
Figure BDA0002418870710000141
In some optional implementation manners of some embodiments, if the relative position relationship of the coordinates of the cells where the first piece of metadata is located is vertical, and the table has only two columns, the table is classified as a solid type table. Specifically, given the following table 5 and a predetermined attribute list (company name, legal representative, registered capital, standing time, registered address, main production and management address, real controller), the attribute list corresponds to a first group of cells [ (0,1), (2,1), (3,1), (4,1), (6,1), (7,1), (9,1) ] in the table, the first group of cells includes seven cells in a second column of the table, the relative positional relationship is vertical, and the table has two columns, the table is classified as a solid type table.
Table 5
Figure BDA0002418870710000151
In some alternative implementations of some embodiments, the table type is determined to be either relational or matrix in response to the table not being an extensible group type table and the table not being a solid type table. Specifically, table 1 is a relational table.
Alternatively, the expandable group in the table of which the type is the expandable group type can be extracted by the following algorithm.
Algorithm 1 extracts all the scalable group inputs in the form, form T
And outputting the cell coordinates G in all the expandable groups in the T.
Figure BDA0002418870710000152
Figure BDA0002418870710000161
Second, the execution body determines the type of the extension mechanism of the table. In response to the table type being a non-extensible group type, determining the type of the extension mechanism as a non-extensible group mechanism type. And determining the type of the extension mechanism as an extensible group mechanism type in response to the table type being the extensible group type.
Step 204, determining new tuple data according to the table, the type of the extension mechanism and the first tuple data.
In some embodiments, the execution body determines new tuple data according to the table, the type of the extension mechanism, and the first tuple data.
In some optional embodiments, in response to the type of the extension mechanism being the extensible group mechanism type, an intra-group extension operation is performed in the table based on the first piece of meta-group data, and the table is extended downward within one cell group. And performing an inter-group expansion operation in the table based on the first piece of cell group data, and expanding between different cell groups.
And in response to the type of the extension mechanism being a non-extensible group mechanism type, executing an intra-group extension operation in the table based on the first meta-group data, and extending the table downwards in a cell group to generate new meta-group data.
Step 205, based on the first piece of tuple data and the new tuple data, determining tuple data in the table corresponding to the attribute list as result tuple data.
In some embodiments, the execution subject splices the first piece of tuple data and the new tuple data, and the obtained spliced tuple data is used as a result tuple data of the corresponding attribute list.
Alternatively, a new blank table may be generated as the extended table, and the resulting tuple data is placed in the extended table. Specifically, corresponding to table 1, the serial number, the project name, the total investment amount of the attribute list are given, the investment amount of the collected fund is used, and the expansion table output in the construction period is table 6.
Table 6
Figure BDA0002418870710000171
And outputting a table 6 to obtain the tuple data corresponding to the attribute list ' serial number, project name and total investment amount ' in the table 1, and using the collected fund investment amount and the construction period '.
Optionally, the executing body responds to (for example, is equal to) the result tuple data corresponding to the target tuple data of the attribute list, where the result tuple data is a final result of extracting the target tuple data of the predetermined attribute list, places the result tuple data in the extension table, sends the extension table to the device and the storage device supporting the display function, controls the device to display the extension table, and controls the storage device to store the extension table. The extended tables and the relational data therein may be used to support subsequent various types of table processing and other applications. Among them, the device supporting the display function and the storage device may be devices communicatively connected to the execution main body. The display-capable device may directly display the received extended form. The extended form is displayed through the equipment supporting the display function, the form processing work is completed by utilizing the extended form, the form processing flow can be simplified, the form processing efficiency is improved, the form processing accuracy is improved, and the follow-up various form processing applications are facilitated. The extended table is stored in the storage device, so that the storage space is saved, and the storage efficiency is improved.
One embodiment presented in fig. 2 has the following beneficial effects: the table codes of the table and the codes of each attribute name in the predetermined attribute list are extracted by adopting the pre-trained first neural network model, so that the context information in the table can be effectively utilized, the robustness of the first neural network model is improved, and the coding quality is improved. Based on form coding and attribute coding, the cell where the first target value corresponding to each attribute in the predetermined attribute list is located and the first metadata corresponding to the predetermined attribute list are predicted, and prediction can be performed on forms in any form without adopting a manually set rule template. And classifying the tables according to the table structure, and determining new metadata of different types of tables by adopting different types of extension mechanisms. Based on the first piece of tuple data and the new tuple data, the resulting tuple data is determined.
With continued reference to FIG. 4, FIG. 4 illustrates a flow 400 of yet another embodiment of a method of determining a cell in an attribute name correspondence table in accordance with the present disclosure. The method for determining the cells in the attribute name correspondence table may include the steps of:
step 401, generating a first sheet feature based on the table code and the attribute code.
In some embodiments, the implementation subject of the method for extracting tuple data in a table (e.g., the terminal device shown in fig. 1) concatenates the table code and the attribute code for each attribute name in the attribute list, resulting in a tensor with a shape of h × w × 1536 dimensions, where h and w are the number of rows and columns of the table, respectively. And adding the position information of the table to the tensor to generate the first tensor characteristic with h, w and 1538 dimensions, wherein the position information of the table comprises the position information of each unit cell in the table in the horizontal direction and the position information of each unit cell in the table in the vertical direction. Specifically, for a cell at the current position in the table, coordinates of the cell are (i, j), i is a row coordinate, j is a column coordinate, then the position information of the cell in the horizontal direction is (i/w) × 2-1, the position information in the vertical direction is (j/h) × 2-1, h and w are the number of rows and columns of the table, respectively, and values of the position information are between-1 and 1, where h and w are the number of rows and columns of the table, respectively. The position information of each cell in the horizontal direction in the table constitutes the position information in the horizontal direction of dimension h x w 1 of the table, and the position information of each cell in the vertical direction in the table constitutes the position information in the vertical direction of dimension h x w 1 of the table
Step 402, inputting the first vector feature into a pre-trained second neural network, and obtaining the possibility that each cell in the table corresponds to the attribute name.
In some embodiments, the executing entity inputs a first scalar feature into the pre-trained second neural network, the first scalar feature being a tensor of dimension h × w × 1538, where h and w are the number of rows and columns of the table, respectively. It will be appreciated that the network structure of the second neural network and the network parameters involved are as follows.
The second neural network consists of three parts. The first part is a 9-layer neural network, each layer consisting of a plurality of two-dimensional planes, and each plane consisting of a plurality of individual neurons. Wherein each layer consists of a convolution layer, a batch normalization processing layer and an activation layer. Optionally, convolutional layers may be used to extract features. It can be determined how many convolution kernels there are for each convolution layer, where each convolution kernel can extract a feature, the size of each convolution kernel, the weight of each neuron in each convolution kernel, the bias term corresponding to each convolution kernel, the step size between two adjacent convolutions, and so on. After each layer of convolution, a batch normalization process is performed, followed by activation with an activation function. Alternatively, the activation function may be a Linear rectification function (ReLU), various varieties of the ReLU activation function, and the like. The input to the first part is a tensor of h x w 1538 dimensions and the output of the first part is a tensor of h x w 512 dimensions, where h and w are the number of rows and columns of the table, respectively.
The second part consists of a pooling layer and a full-link layer. The second part of the input is a tensor of dimension h w 512, where h and w are the number of rows and columns of the table, respectively. The input tensors of h x w x 512 dimensions are respectively input into the pooling layer by rows and columns, and the output is vectors of h x 1 x 512 dimensions and 1 x w x 512 dimensions by rows and columns. The role of the pooling layer is to make the extracted features more compact and reduce the number of neurons. And respectively inputting the tensor of h, w, 512 dimensions output by the first part, and the vectors of h, 1, w, 512 dimensions and 1, w, 512 dimensions output by the pooling layer into the fully-connected layer to respectively obtain a tensor of h, w,2 shape, a tensor of h, 1,2 shape and a tensor of 1, w,2 shape, wherein h and w are respectively the row number and the column number of the table, and the three tensors are output by the second part.
The third part is the classifier Softmax. The input of the third section is the output of the second section. And after the processing of the classifier, outputting the probability that each cell is the same as the attribute name, and representing the possibility that each cell in the table corresponds to the attribute name.
At step 403, a cell corresponding to the attribute name is determined from the table based on the determined likelihood.
In some embodiments, the executive body compares the likelihood that each cell in the table corresponds to the attribute name to a predetermined threshold. The comparison result of the cells is "may be a target cell" or "non-target row (column)", the comparison result of each row is "may be a target row" or "non-target row", and the comparison result of each column is "may be a target column" or "non-target column". Then, the coordinates of the cell having the highest probability of "being likely to be the target cell" are selected as the target cell coordinates to be output from the cells whose comparison result is "being likely to be the target cell". The row (column) coordinate at which the probability of being the target row (column) is the highest is selected as the target row (column) coordinate of the output among the rows (columns) at which the row (column) comparison result is "possible to be the target row (column)".
Optionally, the comparison result of the cell may be used as a judgment basis for each cell in the table corresponding to the attribute name, and the comparison result of each row and each column may be used as an aid to accelerate model convergence. Optionally, in response to that the comparison results of all the cells are "non-target cells", it is determined that the target cell does not exist in the table. In response to the comparison result of all rows (columns) being "non-target rows (columns)", it is determined that no target row (column) exists in the table.
In some embodiments, the execution subject selects a cell whose comparison result is "may be a target cell", outputs the coordinates of the cell, and determines a cell corresponding to the attribute name from the table.
One embodiment presented in fig. 4 has the following beneficial effects: and the second neural network is utilized and the position information in the horizontal direction and the vertical direction is added, so that the relative position information between different cells in the table is completely reserved. And the parameter number of the model is reduced by utilizing the characteristic of weight sharing of the second neural network. Through multilayer convolution, the vector of each cell in the convolutional neural network not only has the text information of the cell, but also contains the text information of surrounding cells, so that the model accuracy is improved, and the model robustness and generalization capability are enhanced.
Referring to fig. 5, fig. 5 illustrates a flow 500 of one embodiment of a method for generating new tuple data according to the present disclosure. The method for generating new tuple data may comprise the steps of:
step 501, in response to the type of the extension mechanism being the type of the extension group mechanism, based on the first piece of tuple data, performing an intra-group extension operation and an inter-group extension operation in the table, and generating new tuple data.
In response to the type of the extension mechanism being the extensible group mechanism type, new tuple data is generated as follows.
The first step, based on the first piece of meta-group data, the intra-group expansion operation is executed in the table, and the table is expanded downwards in a cell group. The coordinates of the cells in the first row of the table and the coordinates of the cells with time type contents in the first group of cells are kept unchanged, 1 is added to the ordinate of each of the other values in the first group of cells until the expansion exceeds the range of the table, and the cells corresponding to the coordinates in the first group of cells do not exist in the table or the group of cells, so that the expansion cannot be continued.
Specifically, for table 7, intra-group expansion is performed based on the first meta-group data (new generation memory interface chip development and industrialization project, 101,785.00, 101,785.00, 3 years).
Table 7
Figure BDA0002418870710000211
Based on the coordinates [ (1,1), (1,2), (1,3), (1,4) ] corresponding to the first tuple data (new generation memory interface chip development and industrialization project, 101,785.00, 101,785.00, 3 years), the intra-group expansion operation is executed, and the operation is expanded to the 3 rd row and the 4 th row. Since the cell (4,1) obtained by expanding the cell (3,1) of the 4 th row downward does not exist, the intra-group expansion stops. New meta-group data after intra-group expansion is obtained, see table 8.
Table 8
Figure BDA0002418870710000221
Specifically, for table 9, intra-group expansion is performed based on the first meta-group data (31/12/2018, house and building, 17,829.55).
Table 9
Item 31/12/2018 31/12/2017 2016 (12 months) and 31 days
House and building 17,829.55 - -
Machine equipment 11,455.07 1,428.62 1,409.92
Transport device 414.11 346.20 108.61
Electronic devices and others 322.81 55.89 55.29
Total up to 30,021.54 1,830.72 1,573.82
Based on the coordinates [ (0,1), (1,0), (1,1) ] corresponding to the first group of cells (31/12/2018, house and building, 17,829.55), the cell of "31/12/2018" remains unchanged in the first row of the table, and the rest of the cells extend downward to the 3 rd, 4 th and 5 th rows. Since cells (5,0) and (5, 1) are the last row cells of the table, the downward expansion cannot continue and the intra-group expansion stops. The new meta-group data after the intra-group extension is obtained, see table 10.
Table 10
Time Name of item Amount of money
31/12/2018 House and building 17,829.55
31/12/2018 Machine equipment 11,455.07
31/12/2018 Transport device 414.11
31/12/2018 Electronic devices and others 322.81
31/12/2018 Total up to 30,021.54
Specifically, for table 11, intra-group expansion is performed based on the first tuple data (2018, 130.00, 20.88% of temma science and technology ltd, shandan county).
Table 11
Figure BDA0002418870710000231
Based on the coordinates [ (1,0), (1,2), (1,3), (1,4) ] corresponding to the first tuple data (2018, 130.00, 20.88% of temma science and technology, shandan county), since the content type of "2018" is the time type, the cell where "2018" is located is kept unchanged, and the rest cells are expanded downwards to the 3 rd row, the 4 th row, the 5 th row and the 6 th row. Since the cell (6,2) obtained by expanding the cell (5,2) downward does not exist, the intra-group expansion stops. New meta-group data after intra-group expansion is obtained, see table 12.
Table 12
Figure BDA0002418870710000232
Specifically, for table 13, intra-group expansion was performed based on the first set of metadata (2018, carbon nanotube powder, 168.44, 1.27%).
Table 13
Figure BDA0002418870710000241
Based on the coordinates [ (0,1), (2,0), (2,1), (2,2) ] corresponding to the first group of metadata (2018, carbon nanotube powder, 168.44, 1.27%), since the content type of "2018" is a time type, the cell where "2018" is located is kept unchanged, and the rest cells are expanded downwards to the 4 th row, the 5 th row and the 6 th row. Since the cell (6,0) resulting from the downward expansion of the cell (5,0) does not exist, the intra-group expansion stops. New meta-group data after intra-group expansion is obtained, see table 14.
Table 14
Time Name of item Amount of money Ratio of
2018 years old Carbon nanotube powder 168.44 1.27%
2018 years old Carbon nanotube conductive paste 13,048.14 98.71%
2018 years old Other products 2.18 0.02%
2018 years old Total up to 13,218.75 100.00%
Optionally, the algorithm for intra-group expansion is as follows:
algorithm 2 intra-group extension
Inputting a table T and a first piece of tuple data D containing N values1={(y1,x1),...,(yN,xN)}
And outputting all the tuple data Y in the first extensible group of the table T.
Figure BDA0002418870710000242
Figure BDA0002418870710000251
In the second step, inter-group expansion is performed in the table based on the first piece of meta-group data. For new cell group data obtained by the intra-group expansion, the cell coordinates not in the expandable group are kept unchanged, and the cell coordinates in the expandable group are changed into the cell coordinates of the corresponding position in other cell groups. And in response to the fact that the expandable group is a horizontal expandable group, adding the horizontal coordinates of all cells of the new cell group data in the expandable group to the horizontal distance between the two groups to obtain the cell coordinates of the cell group data in the new cell group obtained after the group-to-group expansion. And in response to the fact that the expandable group is a longitudinal expandable group, adding the longitudinal coordinates of all cells of the cell group data in the expandable group to the longitudinal distance between the two groups to obtain the cell coordinates of the cell group data in the new cell group obtained after the group is expanded.
Specifically, the table 15 is subjected to intra-group expansion in the first meta-group data (new generation memory interface chip development and industrialization project, 101,785.00, 101,785.00, 3 years).
Table 15
Figure BDA0002418870710000261
Based on the coordinates [ (1,1), (1,2), (1,3), (1,4) ] corresponding to the first tuple data, an intra-group expansion operation is performed to obtain new tuple data after intra-group expansion, see table 16.
Table 16
Figure BDA0002418870710000262
Since the 2 nd, 3 rd and 4 th columns are horizontally expandable groups, inter-group expansion is performed, and the cell group expanded to the 3 rd column results in the result table 17.
Table 17
Time Name of item Amount of money
31/12/2017 House and building -
31/12/2017 Machine equipment 1,428.62
31/12/2017 Transport device 346.20
31/12/2017 Electronic devices and others 55.89
31/12/2017 Total up to 1,830.72
The cell group expanded to column 4 yields the result as in table 18.
Table 18
Time Name of item Amount of money
2016 (12 months) and 31 days House and building -
2016 (12 months) and 31 days Machine equipment 1,409.92
2016 (12 months) and 31 days Transport device 108.61
2016 (12 months) and 31 days Electronic devices and others 55.29
2016 (12 months) and 31 days Total up to 1,573.82
The value of "project name" is not in the extensible group and remains unchanged. The values of "time" and "dollar amount" are spread within the laterally expandable group to other groups by increasing the abscissa. The final result is the extended result for all the expandable groups, as in table 19.
Table 19
Time Name of item Amount of money
31/12/2018 House and building 17,829.55
31/12/2018 Machine equipment 11,455.07
31/12/2018 Transport device 414.11
31/12/2018 Electronic devices and others 322.81
31/12/2018 Total up to 30,021.54
31/12/2017 House and building -
31/12/2017 Machine equipment 1,428.62
31/12/2017 Transport device 346.20
31/12/2017 Electronic devices and others 55.89
31/12/2017 Total up to 1,830.72
2016 (12 months) and 31 days House and building -
2016 (12 months) and 31 days Machine equipment 1,409.92
2016 (12 months) and 31 days Transport device 108.61
2016 (12 months) and 31 days Electronic devices and others 55.29
2016 (12 months) and 31 days Total up to 1,573.82
Specifically, for the table 20, intra-group expansion is performed based on the first tuple data (2018, 130.00, 20.88% of temma science and technology ltd, shandan county).
Table 20
Figure BDA0002418870710000281
Based on the coordinates [ (1,0), (1,2), (1,3), (1,4) ] corresponding to the first tuple data, new tuple data after the group expansion is obtained, see table 21.
Table 21
Time Name of supplier Amount of purchase Ratio of
2018 Shandan County Tianma technology LLC 130.00 20.88%
2018 Shanghai Bai Xuan Biotech Co., Ltd 108.25 17.39%
2018 Zhu Xingming 69.48 11.16%
2018 Huang Ping 64.66 10.39%
2018 SINOPHARM CHEMICAL REAGENT Co.,Ltd. 62.19 9.99%
And performing inter-group expansion, wherein the cell groups from the row 2 to the row 7, the cell groups from the row 8 to the row 13 and the cell groups from the row 14 to the row 19 are longitudinally expandable groups, so that the results obtained in the last step can be expanded among the groups, and the results obtained by expanding the cell groups from the row 8 to the row 13 are shown in a table 22.
Table 22
Figure BDA0002418870710000282
Figure BDA0002418870710000291
The cell groups extending to rows 14 through 19 result in the result shown in table 23.
Table 23
Time Name of supplier Amount of purchase Ratio of
2016 Shandan County Tianma technology LLC 325.00 31.32%
2016 Zhu Xingming 148.03 14.26%
2016 SCHOTT XINKANG DRUGS PACKAGING Co.,Ltd. 144.17 13.89%
2016 Huang Ping 105.25 10.14%
2016 Shanghai Bai Xuan Biotech Co., Ltd 82.97 8.00%
All cells of the first set of cell data are in the vertically expandable set and are expanded to other sets by increasing the vertical coordinate. The final result is the extended result for all the expandable groups, as in table 24.
Table 24
Time Name of supplier Amount of purchase Ratio of
2018 Shandan County Tianma technology LLC 130.00 20.88%
2018 Shanghai Bai Xuan Biotech Co., Ltd 108.25 17.39%
2018 Zhu Xingming 69.48 11.16%
2018 Huang Ping 64.66 10.39%
2018 SINOPHARM CHEMICAL REAGENT Co.,Ltd. 62.19 9.99%
2017 Shandan County Tianma technology LLC 260.00 29.80%
2017 Zhu Xingming 210.10 24.08%
2017 SCHOTT XINKANG DRUGS PACKAGING Co.,Ltd. 189.88 21.76%
2017 Huang Ping 88.39 10.13%
2017 Shanghai Bai Xuan Biotech Co., Ltd 61.36 7.03%
2016 Shandan County Tianma technology LLC 325.00 31.32%
2016 Zhu Xingming 148.03 14.26%
2016 SCHOTT XINKANG DRUGS PACKAGING Co.,Ltd. 144.17 13.89%
2016 Huang Ping 105.25 10.14%
2016 Shanghai Bai Xuan Biotech Co., Ltd 82.97 8.00%
Optionally, the algorithm for interclass extension is as follows:
algorithm 3 inter-group extension
Inputting a table T, and obtaining M pieces of tuple data Y ═ D by group internal expansion1,...,DM}
Outputting all metadata Z in the table T
Figure BDA0002418870710000292
Figure BDA0002418870710000301
Step 502, in response to the expansion mechanism being a non-expandable group mechanism, executing an intra-group expansion operation in the table based on the first meta-group data, and generating new meta-group data.
In response to the type of the extension mechanism being a non-extensible group mechanism type and the table type being a solid type table, no intra-group extension is performed.
And in response to the type of the extension mechanism being a non-extensible group mechanism type and the type of the table being a relational table or a matrix type table, performing an intra-group extension operation in the table based on the first piece of cell group data, and extending the table downwards in one cell group. The coordinates of the cells in the first row of the table and the coordinates of the cells with time type contents in the first group of cells are kept unchanged, 1 is added to the ordinate of each of the other values in the first group of cells until the expansion exceeds the range of the table, and the cells corresponding to the coordinates in the first group of cells do not exist in the table or the group of cells, so that the expansion cannot be continued.
Fig. 5 shows an embodiment with the following advantages: the tables are classified according to the table structure, different types of expansion mechanisms are adopted for determining new metadata of different types of tables, and the tables in any form can be predicted without adopting a manually set rule template.
Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing a terminal device of an embodiment of the present disclosure. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An Input/Output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: a storage portion 606 including a hard disk and the like; and a communication section 607 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 607 performs communication processing via a network such as the internet. Drivers 608 are also connected to the I/O interface 605 as needed. A removable medium 609 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 608 as necessary, so that a computer program read out therefrom is mounted into the storage section 606 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 607 and/or installed from the removable medium 609. The above-described functions defined in the method of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the C language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (10)

1. A method of extracting tuple data in a table, comprising:
acquiring a form and a predetermined attribute list;
determining a first piece of meta-data corresponding to the attribute list based on the table and the attribute list, wherein the first piece of meta-data consists of a first attribute value in the table corresponding to each attribute name in the attribute list;
determining the type of an extension mechanism of the table, wherein the extension mechanism refers to a rule for obtaining the tuple data in the table corresponding to each attribute name in the attribute list based on the first tuple data;
determining new tuple data according to the table, the type of the extension mechanism and the first tuple data;
determining tuple data in the table corresponding to the attribute list as result tuple data based on the first piece of tuple data and the new tuple data.
2. The method of claim 1, wherein the method further comprises:
in response to the result metadata corresponding to target metadata of the attribute list, sending the target metadata to a device and a storage device supporting a display function, controlling the device to display the target metadata, and controlling the storage device to store the target metadata.
3. The method of claim 1, wherein the determining the type of extension mechanism for the table comprises:
determining a table type of the table, wherein the table type comprises a non-extensible group type and an extensible group type;
determining the type of the extension mechanism as a non-extensible group mechanism type in response to the table type being the non-extensible group type;
and determining the type of the extension mechanism as an extensible group mechanism type in response to the table type being an extensible group type.
4. The method of claim 3, wherein the expandable group type is a table type of a table in which cells may span multiple rows or multiple columns.
5. The method of claim 1, wherein determining, based on the table and the attribute list, a first piece of tuple data corresponding to the attribute list comprises:
inputting the form into a first neural network trained in advance to generate a form code;
for each attribute name in the attribute list, inputting the attribute name into the first neural network to generate an attribute code;
for each attribute name in the attribute list, determining a cell in the table corresponding to the attribute name based on the table code and the attribute code of the attribute name;
and determining first piece of metadata corresponding to the attribute list according to the cells in the table corresponding to each attribute name in the attribute list.
6. The method of claim 5, wherein the determining, for each attribute name in the attribute list, that the attribute corresponds to a cell in the table based on the table code and the attribute code of the attribute name comprises:
generating a first sheet feature based on the table code and the attribute code;
inputting the first vector characteristics into a pre-trained second neural network to obtain the possibility that each cell in the table corresponds to the attribute name;
based on the determined likelihood, a cell corresponding to the attribute name is determined from the table.
7. The method of claim 6, wherein the generating a first sheet feature based on the table code and the attribute code comprises:
obtaining a tensorial representation based on the table code and the attribute code;
and adding position information of the table to the tensor representation to generate the first tensor characteristic, wherein the position information of the table comprises position information in the horizontal direction and position information in the vertical direction.
8. The method of claim 3, wherein the generating new tuple data from the table, the extension mechanism, and the first piece of tuple data comprises:
in response to the type of the extension mechanism being an extensible group mechanism type, executing an intra-group extension operation and an inter-group extension operation in the table based on the first piece of meta-group data, and generating new meta-group data;
and in response to the type of the extension mechanism being a non-extensible group mechanism type, executing an intra-group extension operation in the table based on the first piece of tuple data to generate new tuple data.
9. A first terminal device comprising:
one or more processors;
a storage device having one or more programs stored thereon;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
10. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-8.
CN202010199516.9A 2020-03-20 2020-03-20 Method, electronic device and computer readable medium for extracting metadata in table Active CN113496119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010199516.9A CN113496119B (en) 2020-03-20 2020-03-20 Method, electronic device and computer readable medium for extracting metadata in table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010199516.9A CN113496119B (en) 2020-03-20 2020-03-20 Method, electronic device and computer readable medium for extracting metadata in table

Publications (2)

Publication Number Publication Date
CN113496119A true CN113496119A (en) 2021-10-12
CN113496119B CN113496119B (en) 2024-06-21

Family

ID=77993534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010199516.9A Active CN113496119B (en) 2020-03-20 2020-03-20 Method, electronic device and computer readable medium for extracting metadata in table

Country Status (1)

Country Link
CN (1) CN113496119B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024093253A1 (en) * 2022-11-03 2024-05-10 华为云计算技术有限公司 Data sampling method and related device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2005201763A1 (en) * 1999-12-21 2005-05-19 Gannett Satellite Information Network, Inc Information distribution system for use in an elevator
US20060161814A1 (en) * 2003-07-09 2006-07-20 Carl Wocke Method and system of data analysis using neural networks
CN101719166A (en) * 2010-01-20 2010-06-02 中国人民解放军国防科学技术大学 Method for visualizing multi-dimensional time sequence information
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN109800337A (en) * 2018-12-06 2019-05-24 成都网安科技发展有限公司 A kind of multi-mode canonical matching algorithm suitable for big alphabet
CN110188107A (en) * 2019-06-05 2019-08-30 北京神州泰岳软件股份有限公司 A kind of method and device of the Extracting Information from table
CN110489423A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus of information extraction, storage medium and electronic equipment
CN110532834A (en) * 2018-05-24 2019-12-03 北京庖丁科技有限公司 Table extracting method, device, equipment and medium based on rich text format document
CN110659527A (en) * 2018-06-29 2020-01-07 微软技术许可有限责任公司 Form detection in electronic forms
CN110728124A (en) * 2019-10-15 2020-01-24 深圳逻辑汇科技有限公司 Method, apparatus, device and storage medium for visualizing electronic forms
CN110888980A (en) * 2019-10-10 2020-03-17 天津大学 Implicit discourse relation identification method based on knowledge-enhanced attention neural network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2005201763A1 (en) * 1999-12-21 2005-05-19 Gannett Satellite Information Network, Inc Information distribution system for use in an elevator
US20060161814A1 (en) * 2003-07-09 2006-07-20 Carl Wocke Method and system of data analysis using neural networks
CN101719166A (en) * 2010-01-20 2010-06-02 中国人民解放军国防科学技术大学 Method for visualizing multi-dimensional time sequence information
CN110532834A (en) * 2018-05-24 2019-12-03 北京庖丁科技有限公司 Table extracting method, device, equipment and medium based on rich text format document
CN110659527A (en) * 2018-06-29 2020-01-07 微软技术许可有限责任公司 Form detection in electronic forms
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN109800337A (en) * 2018-12-06 2019-05-24 成都网安科技发展有限公司 A kind of multi-mode canonical matching algorithm suitable for big alphabet
CN110188107A (en) * 2019-06-05 2019-08-30 北京神州泰岳软件股份有限公司 A kind of method and device of the Extracting Information from table
CN110489423A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus of information extraction, storage medium and electronic equipment
CN110888980A (en) * 2019-10-10 2020-03-17 天津大学 Implicit discourse relation identification method based on knowledge-enhanced attention neural network
CN110728124A (en) * 2019-10-15 2020-01-24 深圳逻辑汇科技有限公司 Method, apparatus, device and storage medium for visualizing electronic forms

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LONG FEI 等: "Sentiment analysis of text based on bidirectional LSTM with multi-head attention", 《IEEE ACCESS》, vol. 7, 20 September 2019 (2019-09-20), pages 141960 - 141969, XP011749518, DOI: 10.1109/ACCESS.2019.2942614 *
刘岩: "网页中实体表格信息抽取方法的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 03, 15 March 2017 (2017-03-15), pages 138 - 6314 *
姚鹏威: "基于数字图像处理的表格识别", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 07, 15 July 2019 (2019-07-15), pages 138 - 1277 *
杨海涛: "复杂表头表格的关系模式表示", 《计算机工程》, vol. 37, no. 04, pages 49 - 51 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024093253A1 (en) * 2022-11-03 2024-05-10 华为云计算技术有限公司 Data sampling method and related device

Also Published As

Publication number Publication date
CN113496119B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN107293296B (en) Voice recognition result correction method, device, equipment and storage medium
CN112633419B (en) Small sample learning method and device, electronic equipment and storage medium
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN113434683B (en) Text classification method, device, medium and electronic equipment
CN110046356B (en) Label-embedded microblog text emotion multi-label classification method
CN110929802A (en) Information entropy-based subdivision identification model training and image identification method and device
CN112418292A (en) Image quality evaluation method and device, computer equipment and storage medium
CN111428470B (en) Text continuity judgment method, text continuity judgment model training method, electronic device and readable medium
CN112560504B (en) Method, electronic equipment and computer readable medium for extracting information in form document
CN114612921B (en) Form recognition method and device, electronic equipment and computer readable medium
CN115482418B (en) Semi-supervised model training method, system and application based on pseudo-negative labels
CN113496119B (en) Method, electronic device and computer readable medium for extracting metadata in table
CN113111971A (en) Intelligent processing method and device for classification model, electronic equipment and medium
CN117251777A (en) Data processing method, device, computer equipment and storage medium
CN115206421B (en) Drug repositioning method, and repositioning model training method and device
CN115994524A (en) Training method, device, equipment and medium for form pre-training model
CN115098707A (en) Cross-modal Hash retrieval method and system based on zero sample learning
CN112434889A (en) Expert industry analysis method, device, equipment and storage medium
CN114692715A (en) Sample labeling method and device
CN113515920B (en) Method, electronic device and computer readable medium for extracting formulas from tables
CN117172232B (en) Audit report generation method, audit report generation device, audit report generation equipment and audit report storage medium
CN117010799A (en) Warehouse management system and method for electronic commerce
CN114241500A (en) Table structure identification method, device, equipment and storage medium
CN117493591A (en) Video and text cross-modal hash retrieval method based on prompt embedding
CN113496117A (en) Method and electronic equipment for cross checking cell digital content in table

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant