CN111259631A

CN111259631A - Referee document structuring method and device

Info

Publication number: CN111259631A
Application number: CN202010041736.9A
Authority: CN
Inventors: 席丽娜; 王文军; 晋耀红
Original assignee: Dinfo Beijing Science Development Co ltd
Current assignee: Dinfo Beijing Science Development Co ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-09
Anticipated expiration: 2040-01-15
Also published as: CN111259631B

Abstract

The application provides a referee document structuring method and a referee document structuring device. And finally, updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text. Therefore, the structured method of the official document provided by the application can further extract the first structured text, and simultaneously convert the extracted text into a text format which is more in line with the display structure, so that a user can quickly position the required content by browsing.

Description

Referee document structuring method and device

Technical Field

The application relates to the technical field of text processing, in particular to a referee document structuring method and device.

Background

Usually, legal documents such as referee documents are lengthy and obscure in terms, making it difficult to quickly locate content to be browsed through from the overall referee document. Moreover, during browsing the official documents, the user usually needs to browse some cases, i.e. official documents corresponding to cases similar to the current official documents, to help understand and compare the current official documents. For some more special referee documents, such as civil referee documents, some implicit information needs to be extracted from partial information of the text information in a targeted manner on the basis of browsing all the text information. For such official documents, it is difficult for a user to browse one official document, and it is more difficult to find an official document similar to the current official document from a large number of official documents, which not only wastes a lot of time, but also may not accurately find the official document with the highest similarity.

Specifically, for example, if a user needs to search for content related to evidence from a referee document, the user needs to browse from the first character of the referee document, judge a part of content where the evidence may appear after understanding each part of content described in the referee document, and further extract content related to the evidence from the part of content. However, the method of manually analyzing the structure of the official document to obtain the result is not only time-consuming, but also affected by uncertain factors such as learning, thinking and the like, and therefore, the obtained result is very easy to have low accuracy and has no reference value. Therefore, the existing mode for browsing the referee document has lower efficiency and quality.

Disclosure of Invention

The application provides a referee document structuring method and a referee document structuring device, which are used for improving the format standardization of a referee document and facilitating browsing of a user.

In a first aspect, the present application provides a method for structuring a referee document, the method comprising:

extracting block texts in a referee document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block texts in the referee document to be processed;

extracting from the appointed block text of the first structured text by using a second extraction template to obtain a first sub-structured text, wherein the sub-structured text consists of each extraction node in the second extraction template and the corresponding sub-block text in the appointed block text;

converting the sub-block text of the first sub-structured text into a text with a preset characteristic expression format to obtain a second sub-structured text;

and updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text.

In a second aspect, the present application provides an apparatus for structuring official documents, the apparatus comprising:

the first extraction unit is used for extracting the block texts in the official document to be processed by utilizing a first extraction template to obtain a first structured text, wherein the first structured text is composed of each extraction node in the first extraction template and the corresponding block texts in the official document to be processed;

a second extraction unit, configured to extract, by using a second extraction template, from a specified block text of the first structured text to obtain a first sub-structured text, where the sub-structured text is composed of each extraction node in the second extraction template and a corresponding sub-block text in the specified block text;

the conversion unit is used for converting the sub-block text of the first sub-structured text into a text with a preset characteristic expression format to obtain a second sub-structured text;

and the updating unit is used for updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text.

The above technologies can provide a method and an apparatus for structuring a referee document, where a first extraction template is used to extract a block text in a referee document to be processed to obtain a first structured text, a second extraction template is used to extract the block text from a specified block text of the first structured text to obtain a first sub-structured text, and the sub-block text of the first sub-structured text is converted into a text with a preset feature expression format to obtain a second sub-structured text. And finally, updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text. Therefore, the structured method of the official document provided by the application can further extract the first structured text, and simultaneously convert the extracted text into a text format which is more in line with the display structure, so that a user can quickly position the required content by browsing.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a flowchart of a method for structuring a referee document according to an embodiment of the present application;

fig. 2 is a flowchart of a method for extracting a first structured text according to an embodiment of the present application;

fig. 3 is a flowchart of a method for generating a first sub-structured text according to an embodiment of the present application;

fig. 4 is a flowchart of a method for converting a text feature expression format according to an embodiment of the present application;

fig. 5 is a flowchart of a method for converting a text feature expression format according to an embodiment of the present application;

fig. 6 is a flowchart of a method for converting a text feature expression format according to an embodiment of the present application;

fig. 7 is a flowchart of a method for converting a text feature expression format according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an apparatus for structuring a referee document according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the above problems, the present application provides a method and an apparatus for structuring a referee document, so as to form a structured document from a referee text, so that a user can quickly determine the content required by the user in the referee document.

Fig. 1 is a flowchart of a method for structuring a referee document according to an embodiment of the present invention, and as shown in fig. 1, the method includes:

s1, extracting block texts in the referee document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text is composed of each extraction node in the first extraction template and the corresponding block texts in the referee document to be processed.

Inputting the official document to be processed into an official document structuralization device, wherein the official document structuralization device can be a server, a Personal Computer (PC), a tablet personal computer, a mobile phone and other text processing equipment. The official documents to be processed can be all the examination and judgment documents in the civil case, and the like. After receiving the official document to be processed, the official document structuralization device needs to preprocess the official document to be processed and determine the text to be structured, for example, the official document to be processed, which is input into the official document structuralization device, includes a criminal first-pass judgment document, a criminal second-pass judgment document and a criminal final-pass judgment document. And the block text is the text content corresponding to each extraction node in the first extraction template in the referee document to be processed. For example, the contents of the official document to be processed include "party x … approved to find x …. "the first extraction template includes the extraction node" party information, audit finding ", then" party x … "is the block text corresponding to" party information "; "trial finding" is a block text corresponding to "trial finding".

Wherein the first extraction template may be an extraction model that needs to be pre-established before structuring the official document to be processed, and, in particular,

s001, obtaining a referee document sample, wherein the referee document sample belongs to the same category;

s002, dividing each referee document sample into sample block texts according to a preset text division rule;

s003, setting a node title for each sample block text;

s004, combining all the node titles of the same referee document sample to generate a corresponding extracted template sample;

and S005, combining the extracted template samples to generate an extracted template.

The referee document is a text with normalized content, that is, the type of content related to the referee documents of the same type is substantially the same regardless of the format change, for example, the referee document basically relates to the content types such as information of a party, trial pass, request of a litigator, debate by the litigator, trial finding, court opinion, decision result, and the like, and therefore, the extraction template can be generated by training a large number of referee document samples.

Generally, extraction templates corresponding to different types of referee documents are different, and the types refer to case fields, judgment levels and the like related to the referee documents, for example, criminal first-pass judgment, criminal second-pass judgment and civil first-pass judgment belong to three types.

Before training an extraction template for a category of official document, it is necessary to first obtain a large number of official document samples of the category, preferably in a format whose titles correspond to specific text contents, such as "party information-party × …; trial finding-trial finding … ", the format of the referee document sample is most similar to the format of the extracted template to be generated finally, and the training efficiency can be effectively improved.

If the selected referee document sample does not have the format, the referee document sample can be firstly divided into sample block texts according to a preset text division rule, wherein the sample block texts refer to block texts correspondingly contained in each selected referee document sample, and for example, the text division rule includes paragraph division, subtitle division in the text, start character division of a specified paragraph and the like. Then, a node title is set for each sample block text, and this node title is usually a character string that can summarize the semantics of the sample block text, for example, if the sample block text is "party x …", then the node title can be set as "party information". Further, for the same referee document sample, if a node title with repeated semantics appears between the set node titles, sample block texts corresponding to the node titles with repeated semantics can be merged, and one node title is selected as the node title corresponding to the merged sample block text.

After the node titles corresponding to the sample block texts of one referee document sample are obtained, the node titles can be summarized to generate an extracted template sample corresponding to the referee document sample. By training a large number of extracted template samples as described above, an extracted template may be obtained. Further, by continuously enriching the referee's document samples, the generated extraction template can be continuously optimized.

For different categories of referee documents, the method can be adopted to generate corresponding extraction templates.

The various extraction templates generated by the method can be used by the referee document structuralization device at any time without regeneration, so that when the referee document structuralization device uses the extraction templates, a first extraction template suitable for the referee document to be processed needs to be selected from all the extraction templates.

In particular, the amount of the solvent to be used,

s011, extracting target keywords matched with the words in the keyword library from the official document to be processed;

s012, calculating semantic similarity between each target keyword and the template title of each extracted template in all the extracted templates;

s013, calculating the matching degree of the referee document to be processed and each extracted template by combining the weight and the semantic similarity corresponding to each target keyword;

s014, determining a first extraction template, wherein the first extraction template is the extraction template with the highest matching degree.

Usually, words consistent with the category of the official document to be processed inevitably appear in the title or the text of the official document to be processed, and although the words are different, the words have the same meaning, such as "first trial" and "first trial", at this time, the participles in the official document to be processed can be matched with the words in the keyword library, so as to determine the target keywords with semantic similarity higher than the threshold value, which are used for representing the category of the official document to be processed.

The extracted template usually has corresponding template titles, and at this time, the template title with the highest matching degree can be found by matching the target keyword corresponding to the official document to be processed with the template titles, and the extracted template corresponding to the template title is the first extracted template applicable to the official document to be processed.

After determining the first extraction template, determining node characters from the official document to be processed by using the first extraction template, specifically, as shown in fig. 2, there is provided a flowchart of a method for extracting a first structured text according to an embodiment of the present application, where the method includes:

s101, according to each extraction node in a first extraction template, determining node characters in a referee document to be processed, wherein the extraction nodes are character strings corresponding to contents of all parts in the referee document to be processed, and the node characters are initial characters of the contents of the parts, corresponding to the extraction nodes, in the referee document to be processed;

s102, determining a block text corresponding to each extraction node, wherein the block text is all characters from the node character corresponding to the extraction node to the next node character;

s103, corresponding each extraction node to the block text to generate a first structured text.

Specifically, the first extraction template is composed of a plurality of extraction nodes representing texts to be extracted, for example, the extraction nodes in the first extraction template are "head, party information, trial finding", and corresponding texts can be extracted from the official document to be processed according to the extraction nodes, for example, the official document to be processed includes "xx court …, party xx …, trial finding x …, and the like", at this time, the corresponding extracted part of the extraction node "head" is "xx court …", the corresponding extracted part of the extraction node "party information" is "party x …", and the corresponding extracted part of the extraction node "trial finding" is "trial finding x …" as can be known by correspondence.

Specifically, the node character may be determined as follows.

S1011, obtaining an extraction expression corresponding to each extraction node;

s1012, sequentially matching each extraction expression with the head line character of each unmatched paragraph in the referee document to be processed to obtain a matched paragraph, wherein the unmatched paragraph is a paragraph without the matched extraction expression;

and S1013, extracting the first line characters of the corresponding matched paragraph by using the extraction expression to obtain the node characters.

The semantics, which are usually represented by characters located in the same paragraph, are the smallest units of complete semantics, as determined by the writing habit, and therefore, a node character can be found from each search unit with the paragraph as the search unit. Since the node characters are the key for dividing the official document to be processed, the node characters need to have participles or phrases corresponding to the extracted nodes, and therefore, the node characters can be determined by recognizing the participles or phrases, and can be recognized and extracted by using an extraction expression in general. For example, the extraction node is "trial and error finding", and its corresponding extraction expression may be @ \ n [ "n". Is? (authorized? (home)? Checking and finding out: is @ or @ \ n classic (act)? And (4) examining and finding @ and the like, wherein one extraction node corresponds to a plurality of extraction expressions in general so as to adapt to a plurality of expression modes of the extraction node. Therefore, the first line characters of each paragraph can be matched by using the extraction expression, so that the matched first line characters can be found and extracted to obtain the node characters. For example, the paragraph of the official document to be processed is "audited to find out," xx has a debt relation … with xx ", and the node character" audited to find out "can be extracted by extracting the expression.

It should be noted that in the process of matching by using the extraction expression, paragraphs need to be matched one by one, and the matched paragraphs are unmatched paragraphs, so that not only the order of extraction can be ensured, and omission can be prevented, but also the paragraphs with the determined node characters can be prevented from being extracted again, so as to avoid the problems of time waste and extraction errors.

After the node characters are determined, a corresponding block text can be determined according to the node characters, wherein the block text refers to a part of text in the referee document to be processed, the part of text is positioned between two adjacent node characters, and the previous node character is used as a start. For example, the content of the official document to be processed includes "party × …, audited to find × …", it can be determined through the above-described process that "party" and "audited to find" are node characters, and two node characters are adjacent, then "party × …" is a block text corresponding to the extracted node "party information".

After the corresponding block text of each extraction node is determined, the name of the extraction node can be used as a title, and the corresponding relation between each title and the corresponding block text is established, so that the referee document to be processed can be structured into a first structured text consisting of a plurality of extraction nodes and block texts. For example, for the civil opinion judgment, a first extraction template consisting of extraction nodes of "head, party information, trial pass, original appeal, noticed debate, trial finding, court opinion, judgment result, and tail" may be selected and extracted to obtain block texts corresponding to the extraction nodes, and a first structured text may be generated.

And S2, extracting from the appointed block text of the first structured text by using a second extraction template to obtain a first sub-structured text, wherein the sub-structured text consists of each extraction node in the second extraction template and the corresponding sub-block text in the appointed block text.

Part of the block texts in the first structured text may further contain implicit information, which generally refers to text content that is dispersed in the block texts and is needed by users to pay attention to the block texts, but can be obtained through further browsing and extraction. For example, a user needs to obtain an evidence list in a referee document to be processed directly from a structured text, and evidences composing the evidence list are dispersed in corresponding block texts such as an original declaration and a defended declaration, so that the block texts are specified block texts, and the block texts need to be further structured to refine and complete the first structured text.

After the first structured text is obtained, the extraction of the specified block texts in the first structured text is continued, the specified block texts can be determined by adopting the following method,

s211, obtaining a first reference sample, wherein the first reference sample has a text structure same as that of the first structured text;

s212, obtaining the feature to be extracted corresponding to the feature model;

s213, determining a feature block text corresponding to the feature to be extracted in each first reference sample;

s214, summarizing the number of the feature block texts corresponding to the same feature to be extracted;

s215, determining a specified block text, wherein the specified block text is the characteristic block text corresponding to the number of which the ratio of the number to the total number of the first reference samples is greater than or equal to a preset threshold value.

In this embodiment, the feature block text refers to a block text corresponding to a feature to be extracted in the first reference sample, and a specified block text corresponding to the feature model can be determined by learning a large number of first reference samples. The feature model is a model for extracting a specific feature from a block text, and the same feature to be extracted for the feature model usually appears in a relatively fixed block text, for example, the feature to be extracted corresponding to the feature model is "evidence", and usually the feature to be extracted all appears in the block text corresponding to an original appeal, a defendant appeal, and the like, but does not appear in the block text corresponding to a head, a tail, and the like. In order to improve the accuracy of the determination of the specified block text, a large number of first reference samples may be used, wherein the first reference samples have the same text structure as the first structured text, that is, the first structured text is a text composed of the extracted node and the block text corresponding to the extracted node, and then the first reference samples need to be a text having such a text structure. At this time, by determining the position of the feature to be extracted in each first reference sample, the proportion of the feature to be extracted appearing in each block of text can be known, that is, the ratio of the number of feature block texts corresponding to the same feature to be extracted to the total number of the first reference samples. In order to avoid that the feature to be extracted accidentally appears in part of the block texts due to document abnormality and the like, a preset threshold value can be used for screening the specified block texts, namely, the feature block texts with the ratio being greater than or equal to the preset threshold value are used as the specified block texts. For example, the total number of the first reference samples is 100, the feature to be extracted is "evidence", the feature block text is the block text corresponding to "advert dialect", and the number is 80, it can be seen that the ratio of the two is 0.8, and assuming that the preset threshold is 0.75, the block text corresponding to "advert dialect" is the specified block text.

After the specified block text is determined, the specified block text needs to be extracted, and specifically, as shown in fig. 3, for a flowchart of a method for generating a first sub-structured text provided by an embodiment of the present application, the method includes:

s221, determining a feature extraction model corresponding to each extraction node in the second extraction template;

s222, determining a target character string and a target terminator from the specified block text by using the feature extraction model, wherein the target character string is a character string matched with an extraction expression in the feature extraction model, and the target terminator is a preset symbol representing the end of the sub-block text;

s223, determining a sub-block text, wherein the sub-block text is a character which corresponds to the same extraction node and is from the target character string to the target terminator;

s224, corresponding each extraction node in the second extraction template to the sub-block text to generate a first sub-structured text.

Typically, the second extraction template is composed of a plurality of extraction nodes, and these extraction nodes respectively correspond to the contents to be extracted from the specified block of text. For example, the second extraction template is composed of extraction nodes "provenance proof", "defended quality proof", "court certification", and the like. The texts corresponding to these extraction nodes need to be extracted from the specified block texts. In general, there are corresponding feature extraction models at the extraction nodes, and these feature extraction models can extract the character strings matched with the specified block texts by matching the feature words. For example, the block text is specified as "original x name …. In support of its litigation request, the original report provides the home courts with evidence as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. The characteristic extraction model of 'extraction node' original report proof 'is @ \ n [', n. (ii) a 0, 10 original [ "n,. (ii) a {0, 10} to (provide "| presentation" @, then the target string may be determined to be "the original provides the following evidence to the backyard". The preset terminator may be a designated punctuation mark, a designated word segmentation, a designated phrase, a designated sentence, a designated text format, or the like, and generally, the same contents are divided together by a period number according to the writing habit of the text, and thus, the period number may be set as the terminator. The sub-block text in the above example provides the following evidence for the "original report to the home court: 1.…, respectively; 2.…, respectively; 3.… are provided. "

If a plurality of extraction nodes exist in the second extraction template, each extraction node needs to be corresponding to a sub-block text to obtain a first sub-structured text with a corresponding relationship, for example, "provenance proof-provenance provides the following evidence to the courtyard: 1.…, respectively; 2.…, respectively; 3.… are provided. ".

And S3, converting the sub-block text of the first sub-structured text into a text with a preset characteristic expression format to obtain a second sub-structured text.

As can be seen from the foregoing, the feature expression format in the currently obtained first sub-structured text is still formed by mixing multiple pieces of fine information, i.e., multiple pieces of evidence, and is not favorable for browsing, so that the first sub-structured text needs to be converted into the feature expression format.

In one implementation manner, as shown in fig. 4, a flowchart of a method for converting a text feature expression format provided in an embodiment of the present application is provided, where the method includes:

s311, determining a first type of sub-block text from the sub-block texts of the first sub-structured text, wherein the first type of sub-block text is the sub-block text which is matched with a first type of key words and corresponds to an extraction node of the specified block text;

s312, determining target category keywords from the first category sub-block texts, wherein the target category keywords are participles of which the matching degree with preset category keywords is greater than or equal to a preset matching threshold;

s313, determining a classified text, wherein the classified text is a text with the same target category key words in the sub-block texts;

s314, determining a first sequence number identifier from each classified text;

s315, dividing the classified texts by taking the first serial number identifier as a separation node to obtain a first sub-text;

s316, adding a line feed character between two adjacent first subfolders so that one first subfolder corresponds to one paragraph;

s317, generating a second sub-structured text by combining the target category key words, the serial number identifiers and the corresponding first sub-texts.

The first sub-structured text typically contains multiple types of sub-block text, and the results converted will be different for different types of sub-block text. Usually, a corresponding keyword library can be established for different types of sub-block texts, for example, for a first type of sub-block text, the first type of sub-block text is a simple evidence presentation, and therefore proof information, evidence presentation and the like can be generally used as the first type of keywords. Thus, as shown in the above example, the first sub-structured text provides evidence to the courier for "provenance testimony-provenance: 1.…, respectively; 2.…, respectively; 3.… are provided. "the corresponding extraction node is" original declaration ", and the sub-block text can be determined to be the first type of sub-block text by matching with the first type of keywords. At this time, the sub-block text needs to be converted into a corresponding feature expression format.

Following the above example, the first type of sub-block text provides evidence for "the original report to the home court as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. "generally, the action-issuer in the referee document is very important, in which case the action-issuer can be defined as the target, and different targets are different categories, such as original, announcements, courts, etc. Different types of sub-block texts will also correspond to different targets, and therefore, corresponding type keywords can be set for the sub-block texts. At this time, the following evidence can be provided from the original report to the home through word matching: 1.…, respectively; 2.…, respectively; 3.… are provided. If the target category key word is determined to be the original report, the sender of the evidence is the original report.

Further, if there are texts containing a plurality of target category keywords in the first category sub-block text, the first category sub-block text needs to be divided into a plurality of classified texts with the target category keywords as division points, for example, "… is provided for the home courtyard", "… is provided for the court by the subject courtyard", and so on.

Continuing with the refinement and splitting of each classification text, a first ordinal identifier may be determined from the classification text, e.g., "the original provides the home with evidence that: 1.…, respectively; 2.…, respectively; 3.… are provided. 1, 2 and 3 in the above. The "original report to the home theater" is provided with the following evidence using these first sequence number identifiers as the separation points: 1.…, respectively; 2.…, respectively; 3.… are provided. "divided into first sub-texts" 1, … "," 2, … "and" 3, … ", at this time, a line break is added between two adjacent first sub-texts, wherein after the line break is added, each first sub-text can independently occupy a paragraph, wherein each first sub-text can be one line or a plurality of lines of character strings, specifically, an expression format as shown below is obtained,

1、…；

2、…；

3、…。

meanwhile, in order to make the feature representation more clear, the feature representation needs to be combined with the target category keywords to be displayed together, namely, the feature representation is

The original report provides the following evidence to the home:

1、…；

2、…；

3、…。

therefore, the evidence can be displayed in the block text in a list form, so that the user can see the evidence at a glance in the browsing process.

In one implementation manner, as shown in fig. 5, a flowchart of a method for converting a text feature expression format provided in an embodiment of the present application is provided, where the method includes:

s321, determining a second type of sub-block texts from the sub-block texts of the first structured text, wherein the second type of sub-block texts are sub-block texts of which the extraction nodes corresponding to the specified block texts are matched with second type keywords;

s322, dividing the second type of sub-block texts by taking preset separators as nodes to obtain second sub-texts;

s323, extracting a third sub-text from the second sub-text by using a first feature extraction model;

s324, acquiring a second serial number identifier from each third sub text;

s325, determining a target first sub-text corresponding to the third sub-text, wherein the target first sub-text is a first sub-text corresponding to the first serial number identifier which is the same as the second serial number identifier;

s326, extracting a first label keyword from each second sub-text, wherein the first label keyword is a participle matched with a preset label keyword;

s327, combining the third sub-text, the target first sub-text and the first label key word to generate a second sub-structured text.

In the present implementation, some sub-block texts having opinion expression meanings are set as the second type of sub-block texts, which is the same as the principle of setting the first type of keywords in the previous implementation. The second category of keywords may be opinions, attitudes, etc. The second type of sub-block text in the first sub-structured text may be determined by matching.

For example, the second category of sub-block text "original report is unanimous with evidence 1-3 provided by the subject; the original debt disagrees with the evidence 4 provided by the debt, considering that the four ten thousand debt is a single person debt relationship being billed. "the second type of sub-block text may be divided by using a preset delimiter, for example, the preset delimiter is" for example; "then, the second sub-text" the evidence 1-3 provided by the original report to the subject is not objected "and" the evidence 4 provided by the original report to the subject is objected "can be obtained, and the four ten-thousand debt is considered to be the single-person debt relationship of the subject. "

At this time, the corresponding third sub-text may be extracted from each second sub-text using the first feature extraction model. Specifically, the first feature extraction model is to extract a third sub-text from the second sub-text in a feature extraction expression matching manner, for example, if the first feature extraction model is "target + proof of pair + serial number", the third sub-text "proof of pair 1-3" and "proof of pair 4" may be extracted from the second sub-text.

From the third sub-text, a second ordinal identifier, e.g., "1-3" or "4", may be determined. In this case, the first sequence number identifiers determined in the previous implementation may be associated, and these sequence number identifiers are used to represent evidence, i.e. the first sub-text, and it may be considered that the same number or character corresponds to the same first sub-text. At this time, the target first sub-text corresponding to the second serial number identifier may be determined by comparing the first serial number identifier and the second serial number identifier. At this point, the evidence in the third sub-text may be presented in a specific text.

For the second class of sub-block text, it is most important to show the opinion and attitude of these evidences. These opinions and attitudes may be used as a tag keyword, which is the first tag keyword in this implementation manner. Matching can be carried out through preset label keywords, and the second sub-text is determined. For example, if the preset label key words are "objectified" and "objectless", the second sub-text "evidence 1-3 provided by the original report to the subject" and "evidence 4 provided by the original report to the subject" are objectified, and the four-ten-thousand debt is considered to be the single-person debt relationship of the subject. "match, from which the first tag keyword corresponding to each second sub-document can be determined. Meanwhile, the first label key words and the third subfile have corresponding relations.

At this time, the third sub-text, the target first sub-text and the first label key word are combined to obtain the clearly displayed second sub-structured text.

For example, the original report has no objection to evidence 1 …, evidence 2 …, evidence 3 …;

the original debate is incongruous with evidence 4 ….

In one implementation manner, as shown in fig. 6, a flowchart of a method for converting a text feature expression format provided in an embodiment of the present application is provided, where the method includes:

s331, determining a second type of sub-block texts from the sub-block texts of the first structured text, wherein the second type of sub-block texts are sub-block texts of which the extraction nodes corresponding to the specified block texts are matched with second type keywords;

s332, dividing the second type of sub-block texts by taking preset separators as nodes to obtain a fourth sub-text;

s333, extracting a fifth sub-text from each fourth sub-text by using a second feature extraction model;

and S334, combining all the fifth sub texts to generate a second sub structured text.

Compared with the previous implementation mode, the second feature extraction model is in a form of 'target + pair + evidence + label keyword', meanwhile, the second type of sub-block text has a text format conforming to the second feature extraction model, and a fifth sub-text such as 'original report has objection to evidence 4 …' can be directly extracted, so that the fifth sub-text can be directly used as the second structured text.

In one implementation manner, as shown in fig. 7, a flowchart of a method for converting a text feature expression format provided in an embodiment of the present application is provided, where the method includes:

s341, determining a third type of sub-block text from the sub-block texts of the first structured text, wherein the third type of sub-block text is the sub-block text which is matched with the extraction node corresponding to the specified block text and the third type of key words;

s342, dividing the third type of sub-block texts by using preset separators to obtain a sixth sub-block text;

s343, extracting a seventh sub-text from the sixth sub-text by using a third feature extraction model;

s344, acquiring a third serial number identifier from each seventh sub text;

s345, determining a target first sub-text corresponding to the seventh sub-text, wherein the target first sub-text is a first sub-text corresponding to the first serial number identifier which is the same as the second serial number identifier;

s346, extracting a result text from each sixth sub text by using a feature matching formula;

and S347, combining the seventh sub text, the target first sub text and the result text to generate a second sub structured text.

In the present implementation, the sub-block texts having authentication and resolution expressions in some demonstrative texts are set as the third type of sub-block texts, which is the same as the principle of setting the first type of keywords and the second type of keywords in the above implementation. The third category of keywords may be authentication, judgment, and the like. The third type of sub-block text in the first sub-structured text may be determined by matching.

In contrast to the structuring of the second type of sub-block text in the above implementation, the present implementation needs to continue extracting the result text from the sixth sub-text after determining the seventh sub-text and the corresponding target first sub-text. For example, the sixth sub-text is "the home can be used as the evidence 1 for identification, and the feature matching formula can match corresponding characters from the sixth sub-text in a matching manner, such as @ (identification" | determination) \ n [ ", n. (ii) a The fact (I as I contract I) is according to @ and the like. The result text "identify as factual basis" may be extracted from the sixth sub-text. In this way, the seventh sub-text, the target first sub-text, and the result text may be combined to obtain a second sub-structured text.

For example, evidence 1 … was recognized as a basis for the fact by the hospital.

It should be noted that the feature extraction model provided in the foregoing implementation may be adjusted according to actual requirements to extract different objects.

And S4, updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text.

As can be seen from the foregoing, in the official document structuring method provided by the present application, the first sub-structured text only processes part of the sub-block texts in the specified block text, and this process does not cover all of the texts in the specified block text, so after the second sub-structured text is obtained, only the second sub-structured text is required to replace the corresponding content in the first structured text, so as to obtain the second structured text.

For example, the second sub-structured text is:

the original report provides the following evidence to the home:

1、…；

2、…；

3、…。

the content corresponding to the first structured text is "original appeal" — original x call …. In support of its litigation request, the original report provides the home courts with evidence as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. "wherein" the original report provides the following evidence to the home: 1.…, respectively; 2.…, respectively; 3.… are provided. "for the content corresponding to the second sub-structured text, it needs to be replaced by the second sub-structured text, that is, it is

Yuanling claim-YuanqixX …. In order to support its litigation request,

the original report provides the following evidence to the home:

1、…；

2、…；

3、…。

therefore, the structured referee document can show text information to the user more carefully, so that the user can quickly position the required content.

Fig. 8 is a schematic structural diagram of an apparatus for structuring a referee document according to an embodiment of the present application, the apparatus including: a first extraction unit 1, configured to extract a block text in a referee document to be processed by using a first extraction template to obtain a first structured text, where the first structured text is composed of extraction nodes in the first extraction template and corresponding block texts in the referee document to be processed; a second extraction unit 2, configured to extract, by using a second extraction template, from a specified block text of the first structured text to obtain a first sub-structured text, where the sub-structured text is composed of each extraction node in the second extraction template and a corresponding sub-block text in the specified block text; the conversion unit 3 is configured to convert the sub-block text of the first sub-structured text into a text with a preset feature expression format, so as to obtain a second sub-structured text; and the updating unit 4 is configured to update the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text.

Optionally, the first extraction unit includes: a node character determining unit, configured to determine a node character in a referee document to be processed according to each extraction node in a first extraction template, where the extraction node is a character string having a corresponding relationship with each part of content in the referee document to be processed, and the node character is a start character of a part of content corresponding to the extraction node in the referee document to be processed; a block text determining unit, configured to determine a block text corresponding to each extracted node, where the block text is all characters from a node character corresponding to the extracted node to a next node character; and the first structured text generation unit is used for corresponding each extraction node to the block text to generate a first structured text.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for structuring official documents, the method comprising:

2. The method according to claim 1, wherein the extracting the block text in the official document to be processed by using the first extraction template to obtain the first structured text comprises:

determining node characters in a referee document to be processed according to each extraction node in a first extraction template, wherein the extraction nodes are character strings which have corresponding relations with contents of all parts in the referee document to be processed, and the node characters are initial characters of the contents of the parts, corresponding to the extraction nodes, in the referee document to be processed;

determining a block text corresponding to each extraction node, wherein the block text is all characters from the node character corresponding to the extraction node to the next node character;

and corresponding each extraction node to the block text to generate a first structured text.

3. The method of claim 1, wherein extracting from the specified block text of the first structured text using the second extraction template to obtain a first sub-structured text comprises:

determining a feature extraction model corresponding to each extraction node in the second extraction template;

determining a target character string and a target terminator from the specified block text by using the feature extraction model, wherein the target character string is a character string matched with an extraction expression in the feature extraction model, and the target terminator is a preset symbol representing the end of the sub-block text;

determining a sub-block text, wherein the sub-block text is a character which corresponds to the same extraction node and is from the target character string to the target terminator;

and corresponding each extraction node in the second extraction template to the sub-block text to generate a first sub-structured text.

4. The method of claim 3, wherein converting the sub-block text of the first sub-structured text into text having a predetermined eigen-expression format to obtain a second sub-structured text comprises:

determining a first type of sub-block text from the sub-block texts of the first sub-structured text, wherein the first type of sub-block text is the sub-block text of which the extraction node corresponding to the specified block text is matched with the first type of key words;

determining a target category keyword from the first category sub-block text, wherein the target category keyword is a participle with a matching degree with a preset category keyword being greater than or equal to a preset matching threshold;

determining a classified text, wherein the classified text is a text with the same target category key word in the sub-block text;

determining a first ordinal identifier from each of said classified texts;

dividing the classified texts by taking the first serial number identifier as a separation node to obtain a first sub-text;

adding a line feed character between two adjacent first subfolders so that one first subfolder corresponds to one paragraph;

and generating a second sub-structured text by combining the target category key words, the serial number identifiers and the corresponding first sub-text.

5. The method of claim 4, wherein converting the sub-block text of the first sub-structured text into text having a predetermined feature expression format to obtain a second sub-structured text comprises:

determining a second type of sub-block texts from the sub-block texts of the first sub-structured texts, wherein the second type of sub-block texts are sub-block texts of which the extraction nodes corresponding to the specified block texts are matched with second type keywords;

dividing the second type of sub-block texts by taking preset separators as nodes to obtain second sub-texts;

extracting a third sub-text from the second sub-text by using a first feature extraction model;

acquiring a second serial number identifier from each third sub text;

determining a target first sub-text corresponding to the third sub-text, wherein the target first sub-text is a first sub-text corresponding to the first serial number identifier which is the same as the second serial number identifier;

extracting a first label keyword from each second sub-text, wherein the first label keyword is a participle matched with a preset label keyword;

and generating a second sub-structured text by combining the third sub-text, the target first sub-text and the first label key word.

6. The method of claim 3, wherein converting the sub-block text of the first sub-structured text into text having a predetermined eigen-expression format to obtain a second sub-structured text comprises:

dividing the second type of sub-block texts by taking preset separators as nodes to obtain fourth sub-texts;

extracting a fifth sub-text from each fourth sub-text by using a second feature extraction model;

and combining all the fifth sub-texts to generate a second sub-structured text.

7. The method of claim 5, wherein converting the sub-block text of the first sub-structured text into text having a predetermined feature expression format to obtain a second sub-structured text comprises:

determining a third type of sub-block texts from the sub-block texts of the first sub-structured text, wherein the third type of sub-block texts are sub-block texts of which the extraction nodes corresponding to the specified block texts are matched with third type keywords;

dividing the third type of sub-block texts by using a preset separator to obtain a sixth sub-text;

extracting a seventh sub-text from the sixth sub-text by using a third feature extraction model;

acquiring a third serial number identifier from each seventh sub-text;

determining a target first sub-text corresponding to the seventh sub-text, wherein the target first sub-text is a first sub-text corresponding to the first serial number identifier which is the same as the second serial number identifier;

extracting a result text from each sixth sub-text by using a feature matching formula;

and combining the seventh sub-text, the target first sub-text and the result text to generate a second sub-structured text.

8. The method of claim 1, wherein the updating the corresponding content in the first structured text with the second sub-structured text to obtain a second structured text comprises:

and replacing the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text.

9. An apparatus for structuring official documents, comprising:

10. The apparatus of claim 9, wherein the first extraction unit comprises:

a node character determining unit, configured to determine a node character in a referee document to be processed according to each extraction node in a first extraction template, where the extraction node is a character string having a corresponding relationship with each part of content in the referee document to be processed, and the node character is a start character of a part of content corresponding to the extraction node in the referee document to be processed;

a block text determining unit, configured to determine a block text corresponding to each extracted node, where the block text is all characters from a node character corresponding to the extracted node to a next node character;

and the first structured text generation unit is used for corresponding each extraction node to the block text to generate a first structured text.