CN112541373A

CN112541373A - Judicial text recognition method, text recognition model obtaining method and related equipment

Info

Publication number: CN112541373A
Application number: CN201910891596.1A
Authority: CN
Inventors: 曾祥辉; 冯鸳鹤
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2021-03-23
Anticipated expiration: 2039-09-20
Also published as: CN112541373B; WO2021051957A1

Abstract

The invention discloses a judicial text recognition method, a text recognition model obtaining method and related equipment, which can obtain text contents in judicial texts; inputting the obtained text content into a preset first judicial text recognition model, and obtaining a first probability vector of the text content output by the first judicial text recognition model as a preset component of the judicial text; comparing the text content with the preset knowledge graph characteristics, and obtaining a second probability vector of the text content being a preset component of the judicial text according to the comparison result; splicing the first probability vector and the second probability vector into a third probability vector; and inputting the third probability vector into a preset second judicial text recognition model, and obtaining a recognition result that the text content output by the second judicial text recognition model is a preset component of the judicial text. The invention can effectively improve the identification accuracy.

Description

Judicial text recognition method, text recognition model obtaining method and related equipment

Technical Field

The invention relates to the technical field of text processing, in particular to a judicial text recognition method, a text recognition model obtaining method and related equipment.

Background

In the jurisdictional domain, users may be involved in a large amount of jurisdictional text, such as: official documents, prosecution books and notes. The user can search for a certain component in the judicial text by scrolling through the judicial text, for example: a solicitation request portion, a de-facto portion, or a decision result portion. However, the judicial texts are usually longer in length, and when the number of the judicial texts that the user needs to browse is larger, the user needs to spend more time on browsing the judicial texts, thereby reducing the efficiency of obtaining the components in the judicial texts.

In order to improve the efficiency of obtaining components in judicial texts by users, the prior art can automatically identify the text content belonging to each component through a regular expression. However, the regular expression can only identify texts with uniform text characteristics, and because the forms of the judicial texts are various and the text characteristics of the text contents are various, the accuracy of identifying the components of the judicial texts by using the regular expression is low.

Disclosure of Invention

In view of the above problems, the present invention provides a judicial text recognition method, a text recognition model obtaining method and related devices, which overcome or at least partially solve the above problems, and the technical solutions are as follows:

a method of forensic text recognition comprising:

acquiring text content in a judicial text;

inputting the obtained text content into a preset first judicial text recognition model, and obtaining a first probability vector of the text content output by the first judicial text recognition model as a preset component of a judicial text;

comparing the text content with preset knowledge graph characteristics, and obtaining a second probability vector of the text content being the preset component of the judicial text according to a comparison result, wherein the preset knowledge graph characteristics correspond to the preset component;

concatenating the first probability vector and the second probability vector into a third probability vector;

and inputting the third probability vector into a preset second judicial text recognition model, and obtaining a recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text.

Optionally, the preset knowledge graph characteristics include at least one of a regular expression, a template of the preset component, an entity vocabulary, and a concept vocabulary,

the comparing the text content with the preset knowledge graph characteristics, and obtaining a second probability vector of the text content being the preset component of the judicial text according to the comparison result, includes at least one of the following judgment processes:

judging whether the text content conforms to the regular expression or not, and obtaining a first result vector according to a judgment result;

judging whether the text content conforms to the template of the preset component part or not, and obtaining a second result vector according to a judgment result;

judging whether the text content contains the entity vocabulary or not, and obtaining a third result vector according to a judgment result;

judging whether the text content contains the concept vocabulary or not, and obtaining a fourth result vector according to a judgment result;

the comparing the text content with the preset knowledge graph characteristics, and obtaining a second probability vector of the text content being the preset component of the judicial text according to the comparison result, further includes:

and obtaining a second probability vector of the text content being the preset component of the judicial text according to at least one vector of the first result vector, the second result vector, the third result vector and the fourth result vector.

Optionally, after the third probability vector is input into a preset second judicial text recognition model, and a recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text is obtained, the method further includes:

and checking whether the text content meets a preset check rule, and if not, determining that the text content is the preset component of the judicial text.

Optionally, the weight of the first probability vector is a first preset weight, and the weight of the second probability vector is a second preset weight, where a sum of the first preset weight and the second preset weight is 1.

A text recognition model obtaining method, comprising:

obtaining a plurality of training texts, wherein the training texts are text contents in judicial texts, the training texts correspond to preset identifications, and the preset identifications are: an identification of a preset component of the judicial text or an identification of a non-preset component of the judicial text;

inputting the obtained training text into a preset first judicial text recognition model, and obtaining a fourth probability vector of the training text output by the first judicial text recognition model as a preset component of the judicial text;

comparing the training text with preset knowledge graph characteristics, and obtaining a fifth probability vector of the training text which is the preset component of the judicial text according to a comparison result, wherein the preset knowledge graph characteristics correspond to the preset component;

splicing the fourth probability vector and the fifth probability vector into a sixth probability vector;

performing machine learning on the sixth probability vector according to the preset identification corresponding to the training text to obtain a second judicial text recognition model, wherein the second judicial text recognition model has the following input: the probability vector output by the first judicial text recognition model is spliced with the probability vector obtained according to the comparison result of the text content and the preset knowledge map characteristics, and the output of the second judicial text recognition model is as follows: the text content is a recognition result of the preset component of the judicial text.

A judicial text recognition device comprising: a text content obtaining unit, a first probability vector obtaining unit, a second probability vector obtaining unit, a third probability vector obtaining unit and a recognition result obtaining unit,

the text content obtaining unit is used for obtaining text content in a judicial text;

the first probability vector obtaining unit is configured to input the obtained text content into a preset first judicial text recognition model, and obtain a first probability vector that the text content output by the first judicial text recognition model is a preset component of a judicial text;

the second probability vector obtaining unit is used for comparing the text content with preset knowledge graph characteristics and obtaining a second probability vector of the text content being the preset component of the judicial text according to a comparison result, wherein the preset knowledge graph characteristics correspond to the preset component;

the third probability vector obtaining unit is configured to splice the first probability vector and the second probability vector into a third probability vector;

and the recognition result obtaining unit is used for inputting the third probability vector into a preset second judicial text recognition model and obtaining the recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text.

Optionally, the apparatus further comprises: the rule checking unit is used for checking the rule,

the rule checking unit is configured to check whether the text content meets a preset checking rule after the recognition result obtaining unit obtains the recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text, and if not, determine that the text content is the preset component of the judicial text.

A text recognition model obtaining apparatus comprising: a training text obtaining unit, a fourth probability vector obtaining unit, a fifth probability vector obtaining unit, a sixth probability vector obtaining unit and a judicial text recognition model obtaining unit,

the training text obtaining unit is used for obtaining a plurality of training texts, wherein the training texts are text contents in judicial texts, the training texts correspond to preset identifications, and the preset identifications are: an identification of a preset component of the judicial text or an identification of a non-preset component of the judicial text;

the fourth probability vector obtaining unit is configured to input the obtained training text into a preset first judicial text recognition model, and obtain a fourth probability vector that the training text output by the first judicial text recognition model is a preset component of the judicial text;

the fifth probability vector obtaining unit is configured to compare the training text with a preset knowledge graph feature, and obtain a fifth probability vector that the training text is the preset component of the judicial text according to a comparison result, where the preset knowledge graph feature corresponds to the preset component;

the sixth probability vector obtaining unit is configured to splice the fourth probability vector and the fifth probability vector into a sixth probability vector;

the judicial text recognition model obtaining unit is configured to perform machine learning on the sixth probability vector according to the preset identifier corresponding to the training text to obtain a second judicial text recognition model, where the second judicial text recognition model is input as follows: the probability vector output by the first judicial text recognition model is spliced with the probability vector obtained according to the comparison result of the text content and the preset knowledge map characteristics, and the output of the second judicial text recognition model is as follows: the text content is a recognition result of the preset component of the judicial text.

A storage medium having stored thereon a program which, when executed by a processor, implements any of the above-described judicial text recognition methods, and/or the above-described text recognition model obtaining methods.

An electronic device comprising at least one processor, and at least one memory connected to the processor, a bus; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform any one of the above-mentioned judicial text recognition methods, and/or the above-mentioned text recognition model obtaining method.

By the technical scheme, the judicial text identification method, the text identification model obtaining method and the related equipment can obtain the text content in the judicial text; inputting the obtained text content into a preset first judicial text recognition model, and obtaining a first probability vector of the text content output by the first judicial text recognition model as a preset component of a judicial text; comparing the text content with preset knowledge graph characteristics, and obtaining a second probability vector of the text content being the preset component of the judicial text according to a comparison result, wherein the preset knowledge graph characteristics correspond to the preset component; concatenating the first probability vector and the second probability vector into a third probability vector; and inputting the third probability vector into a preset second judicial text recognition model, and obtaining a recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text. The embodiment of the invention overcomes the technical problem of low accuracy in the prior art by splicing the first probability vector and the second probability vector of the text content into the third probability vector and inputting the third probability vector into the second judicial text recognition model to obtain the recognition result.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flow chart illustrating a judicial text recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a text recognition model obtaining method according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating another judicial text recognition method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram illustrating a judicial text recognition apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram illustrating a text recognition model obtaining apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of another judicial text recognition device provided by the embodiment of the invention;

fig. 7 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in fig. 1, a judicial text recognition method provided in the embodiment of the present invention may include:

s100, obtaining text contents in a judicial text;

specifically, the judicial texts may include official documents, prosecution books, notes, and the like. The text content obtained by S100 may be all or part of the content in the judicial text. When the obtained text content is part of the content in the judicial text, the part of the content may be the text content of one or more natural paragraphs in the judicial text, or may be the text content of one or more sentences in the judicial text. The present invention may perform segmentation and/or sentence segmentation processing by paragraph identifiers, periods, etc., to obtain one or more natural paragraphs, or to obtain one or more sentences.

S200, inputting the obtained text content into a preset first judicial text recognition model, and obtaining a first probability vector of the text content output by the first judicial text recognition model as a preset component of a judicial text;

specifically, the preset first judicial text recognition model can be obtained by performing machine learning on a plurality of training texts. The training text used by the invention can be corresponding to a preset identifier, and the preset identifier can be: an identification of a preset component of the judicial text or an identification of a non-preset component of the judicial text. Specifically, the input of the first judicial text recognition model may be a character, or may be a vector matrix obtained by converting text content, and the present invention is not limited herein. The training text with the identifier of the preset component of the judicial text may be referred to as a positive sample, and the training text with the identifier of the non-preset component of the judicial text may be referred to as a negative sample. The method can perform machine learning through the positive sample and the negative sample, so as to obtain the first judicial text recognition model with higher recognition accuracy. The output of the first judicial text recognition model may be a probability vector with text content being a preset component of the judicial text. Specifically, the probability vector may be obtained according to probability transformation. For example, if the first judicial text recognition model determines that the probability that the text content is a preset component of the judicial text is 0.8, the first probability vector output by the first judicial text recognition model may be [ 0.80.2 ].

It is noted that the first probability vector may comprise a probability that the text content is a preset component of the judicial text, or may comprise a probability that the text content is not a preset component of the judicial text. For example, the first probability vector may be [ 0.930.07 ], where 0.93 is the probability that the textual content is a preset component of the judicial text and 0.07 is the probability that the textual content is not a preset component of the judicial text.

Of course, in practical application, the invention can use a plurality of different machine learning algorithms to perform machine learning on the training text, thereby obtaining a plurality of different first judicial text recognition models. Furthermore, the invention can also test each first judicial text recognition model through a plurality of test texts. The test sample used in the present invention may also correspond to a preset identifier. Through testing, the method can obtain the first judicial text recognition model with the highest recognition accuracy from a plurality of different first judicial text recognition models, and perform the processing of the step S200 by taking the model as the preset first judicial text recognition model.

For ease of understanding, the description is made herein by way of example: assuming that the training text is a sentence and the preset component of the judicial text is the litigation request part of the referee document, the invention can extract the sentences in the litigation request parts of the referee documents and add litigation request labels to the sentences. The invention can also extract sentences of non-litigation request parts in a plurality of referee documents and add non-litigation tags to each sentence. The method can take all sentences added with litigation request tags as positive samples and all sentences added with non-litigation request tags as negative samples. Then, the invention can respectively use machine learning algorithms such as TextRCNN, TextCNN, LR and the like to perform machine learning on the positive sample and the negative sample, and obtain a plurality of first judicial text recognition models corresponding to different machine learning algorithms. Then, the invention can test the first judicial text recognition models through the test text, obtain a first judicial text recognition model with the highest recognition accuracy rate, and perform the processing of step S200 by using the model as a preset first judicial text recognition model. Of course, if the recognition accuracy of the first judicial text recognition model with the highest recognition accuracy is lower than the preset accuracy, the method can also add more training texts (for example, training the test text as the training text) and perform machine learning again to improve the recognition accuracy of the first judicial text recognition model until the recognition accuracy is not lower than the preset accuracy.

The step S200 of inputting the obtained text content into a preset first judicial text recognition model may include: and converting the obtained text content into a vector matrix, and inputting the vector matrix into a preset first judicial text recognition model.

Because the vector matrix is a vector expression of text content formed by characters, the vector matrix is input into the first judicial text recognition model, namely the obtained text content is input into the preset first judicial text recognition model.

Specifically, the process of converting the text content into the vector matrix may include:

performing word segmentation processing on the text content to obtain a plurality of words;

determining word vectors corresponding to all vocabularies in the plurality of vocabularies;

and converting each word vector into a vector matrix according to the arrangement sequence of the words in the text content.

S300, comparing the text content with preset knowledge graph characteristics, and obtaining a second probability vector of the text content being the preset component of the judicial text according to a comparison result, wherein the preset knowledge graph characteristics correspond to the preset component;

optionally, the preset knowledge graph characteristics may include at least one of a regular expression, a template of the preset component, an entity vocabulary, and a concept vocabulary. The regular expression, the template of the preset component, the entity vocabulary and the concept vocabulary can embody the characteristics of the preset component of the judicial text from different scales. Specifically, the entity vocabulary and the concept vocabulary can embody the characteristics of smaller scale, and the regular expression and the template of the preset component part can embody the characteristics of larger scale.

The embodiment of the invention can compare the text content with the preset knowledge graph characteristics and judge whether the text content accords with the preset knowledge graph characteristics. For example, when the preset component of judicial text is a litigation request for a referee document, preset knowledge-graph features may include "request & story," "law & property," and so on. The second probability vector may be a probability vector of whether the text content conforms to a preset knowledge-graph feature. When the text content conforms to the preset knowledge graph characteristics, the second probability vector may be [ 10 ], which represents that the text content completely conforms to the preset knowledge graph characteristics. When the text content does not accord with the preset knowledge graph characteristics, the second probability vector can be [ 01 ], and the text content is represented to be completely not accord with the preset knowledge graph characteristics. According to the embodiment of the invention, the text content is compared with the preset knowledge graph characteristics, so that the obtained second probability vector can better accord with the judgment of a judicial field person on whether the text content is the preset component of the judicial text.

Optionally, the preset knowledge graph feature includes at least one of a regular expression, a template of the preset component, an entity vocabulary, and a concept vocabulary, and therefore, the step S300 may include at least one of the following determination processes:

and judging whether the text content contains the concept vocabulary or not, and obtaining a fourth result vector according to a judgment result.

Wherein, the regular expression may include: "request & notice", "law & property", "request | please court", and the like. It is understood that the regular expression may be a combination of relationships between a plurality of phrases, and the embodiments of the present invention are not illustrated one by one. The first result vector may be [ 10 ] when the text content is completely compliant with the regular expression and [ 01 ] when the text content is completely non-compliant with the regular expression. Of course, the value of the element included in the first result vector may also be a decimal number to indicate the extent to which the text content conforms to the regular expression, for example: [0.30.7].

Specifically, the template of the preset component part may be a template commonly used by persons related to the judicial field when writing a certain judicial text part. For example, if the preset component of the judicial text is a litigation request part, the template of the preset component may include at least one sentence structure of the preset component, which may be represented by a character string, for example: the character string "judge was Company | Person pay" represents a sentence structure in the litigation request section, and Company and Person in the character string are the entity vocabulary of the Company and the entity vocabulary of the Person, respectively. The templates of the preset components of the present application may include entity vocabulary, concept vocabulary, and the like.

The embodiment of the invention can judge whether the text content conforms to the format in the template of the preset component part, and determine whether the text content conforms to the template of the preset component part according to the judgment result. The second result vector may be [ 10 ] when the text content conforms to the template of the preset component, and [ 01 ] when the text content does not conform to the template of the preset component. Of course, the value of the element included in the second result vector may also be a decimal number to indicate the degree to which the text content conforms to the template of the preset component.

Specifically, the physical vocabulary may be a vocabulary of objectively existing and mutually distinguishable transactions, such as "trademark office", "trademark review board", and the like. One or more entity vocabularies can be preset in the embodiment of the invention. The third result vector may be [ 10 ] when the textual content includes the entity vocabulary and [ 01 ] when the textual content does not include the entity vocabulary. Of course, the value of the element included in the third result vector may be a decimal number to indicate the ratio of the preset entity vocabulary included in the text content to all the preset entity vocabularies.

Specifically, a concept vocabulary may be a vocabulary of relatively existing things, such as: "original report", "subject", "self-complaint person", "official complaint" and "appetizer". One or more concept vocabularies can be preset in the embodiment of the invention. The fourth result vector may be [ 10 ] when the textual content includes the concept vocabulary and [ 01 ] when the textual content does not include the concept vocabulary. Of course, the value of the element included in the fourth result vector may be a decimal number to indicate the ratio of the preset concept vocabulary included in the text content to all the preset concept vocabularies.

Further, step S300 may further include:

For ease of understanding, the description is made herein by way of example: in an alternative embodiment of the present invention, if the predetermined knowledge-graph characteristics include a regular expression, a template of a predetermined component, an entity word and a concept word, the second probability vector may be [ 10101001 ] when the text content conforms to the regular expression, conforms to the template of the predetermined component, contains the entity word and does not contain the concept word. In another optional embodiment of the present invention, if the preset knowledge-graph features include a regular expression and a template of a preset component, the second probability vector may be [ 1001 ] when the text content conforms to the regular expression and does not conform to the template of the preset component.

According to the embodiment of the invention, the weights can be respectively distributed to the first probability vector and the second probability vector according to the accuracy of the first probability vector and the second probability vector. The skilled person can detect the accuracy of the first probability vector and the second probability vector by the training text, for example, the text content of the preset component determined to be the judicial text can be used as the training text, and whether the first probability vector and the second probability vector of the training text are the preset component of the judicial text or not can be detected respectively. Optionally, the weight of the first probability vector is a first preset weight, and the weight of the second probability vector is a second preset weight, where a sum of the first preset weight and the second preset weight is 1. For example, the first preset weight may be 0.85 and the second preset weight may be 0.15.

S400, splicing the first probability vector and the second probability vector into a third probability vector;

for ease of understanding, the description is made herein by way of example: if the first probability vector is [ 0.860.14 ] and the second probability is [ 10011001 ], then the third probability vector may be [ 0.860.1410011001 ]. It will be appreciated that the third probability vector may also be [ 011001100.860.14 ].

Optionally, step S400 may specifically include:

multiplying the first probability vector by a first preset weight to obtain a first weight vector;

multiplying the second probability vector by a second preset weight to obtain a second weight vector;

and splicing the first weight vector and the second weight vector into a third probability vector.

S500, inputting the third probability vector into a preset second judicial text recognition model, and obtaining a recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text.

The preset second judicial text recognition model can be obtained by performing machine learning on a plurality of probability vectors. The second judicial text recognition model can also use positive samples and negative samples to perform machine learning, or can be trained through a plurality of probability vectors first and then subjected to accuracy testing. The invention provides a method for obtaining a text recognition model, and a preset second judicial text recognition model can be obtained according to the method. As shown in fig. 2, a method for obtaining a text recognition model according to an embodiment of the present invention may include:

s010, obtaining a plurality of training texts, wherein the training texts are text contents in judicial texts, the training texts correspond to preset identifications, and the preset identifications are: an identification of a preset component of the judicial text or an identification of a non-preset component of the judicial text;

s020, inputting the obtained training text into a preset first judicial text recognition model, and obtaining a fourth probability vector of the training text output by the first judicial text recognition model as a preset component of the judicial text;

the preset first judicial text recognition model in the step S020 is the same as the preset first judicial text recognition model used in the step S200.

S030, comparing the training text with preset knowledge graph characteristics, and obtaining a fifth probability vector of the training text which is the preset component of the judicial text according to a comparison result, wherein the preset knowledge graph characteristics correspond to the preset component;

the preset knowledge graph characteristics used in step S030 are the same as the preset knowledge graph characteristics used in step S300.

S040, the fourth probability vector and the fifth probability vector are spliced into a sixth probability vector;

s050, machine learning is carried out on the sixth probability vector according to the preset identification corresponding to the training text, and a second judicial text recognition model is obtained, wherein the second judicial text recognition model is input as follows: the probability vector output by the first judicial text recognition model is spliced with the probability vector obtained according to the comparison result of the text content and the preset knowledge map characteristics, and the output of the second judicial text recognition model is as follows: the text content is a recognition result of the preset component of the judicial text.

Specifically, the method shown in fig. 2 may perform machine learning on a sixth probability vector obtained according to the training text corresponding to the preset identifier, and since the sixth probability vector is obtained by splicing the fourth probability vector and the fifth probability vector, the second judicial text recognition model obtained by training with the method shown in fig. 2 integrates the feature comparison result of the knowledge map and the recognition result of the first judicial text recognition model, and the recognition accuracy is high.

Specifically, the second judicial text recognition model may recognize whether the text content is the preset component of the judicial text. For example: when the third probability vector is [ 0.680.3210101010 ], the preset second judicial text recognition model may output a recognition result that "the text content is the preset component of the judicial text".

It can be understood that, in the embodiment of the present invention, the recognition result that the text content output by the second judicial text recognition model is not the preset component of the judicial text may also be obtained, for example: when the third probability vector of the input second judicial text recognition model is [ 0.320.6810101010 ], the preset second judicial text recognition model may output "a recognition result that the text content is not the preset component of the judicial text".

The judicial text identification method provided by the embodiment of the invention can obtain the text content in the judicial text; inputting the obtained text content into a preset first judicial text recognition model, and obtaining a first probability vector of the text content output by the first judicial text recognition model as a preset component of a judicial text; comparing the text content with preset knowledge graph characteristics, and obtaining a second probability vector of the text content being the preset component of the judicial text according to a comparison result, wherein the preset knowledge graph characteristics correspond to the preset component; concatenating the first probability vector and the second probability vector into a third probability vector; and inputting the third probability vector into a preset second judicial text recognition model, and obtaining a recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text. The embodiment of the invention overcomes the technical problem of low accuracy in the prior art by splicing the first probability vector and the second probability vector of the text content into the third probability vector and inputting the third probability vector into the second judicial text recognition model to obtain the recognition result.

Optionally, based on the method shown in fig. 1, as shown in fig. 3, another judicial text recognition method provided in the embodiment of the present invention may further include, after step S500:

s600, checking whether the text content meets a preset checking rule, and if not, determining that the text content is the preset component of the judicial text.

Specifically, the preset verification rule may be a text rule that is obviously not the preset component of the judicial text. And when the text content meets a preset verification rule, the text content is obviously not the preset component of the judicial text. If the preset check rule is not met, the text content is determined to be the preset component of the judicial text.

The embodiment of the invention can provide a vote-negative preset check rule, effectively avoids the error of a second judicial text recognition model obtained by machine learning to recognize the preset component of the judicial text, and further improves the accuracy of recognizing whether the text content is the preset component of the judicial text.

Corresponding to the above method embodiment, this embodiment further provides a judicial text recognition device, whose structure is shown in fig. 4, and may include: a text content obtaining unit 100, a first probability vector obtaining unit 200, a second probability vector obtaining unit 300, a third probability vector obtaining unit 400, and a recognition result obtaining unit 500.

The text content obtaining unit 100 is configured to obtain text content in a judicial text.

Specifically, the judicial texts may include official documents, prosecution books, notes, and the like. The text content obtained by the text content obtaining unit 100 may be all or part of the content in the judicial text. When the obtained text content is part of the content in the judicial text, the part of the content may be the text content of one or more natural paragraphs in the judicial text, or may be the text content of one or more sentences in the judicial text. The present invention may perform segmentation and/or sentence segmentation processing by paragraph identifiers, periods, etc., to obtain one or more natural paragraphs, or to obtain one or more sentences.

The first probability vector obtaining unit 200 is configured to input the obtained text content into a preset first judicial text recognition model, and obtain a first probability vector that the text content output by the first judicial text recognition model is a preset component of a judicial text.

Specifically, the preset first judicial text recognition model can be obtained by performing machine learning on a plurality of training texts. The training text used by the invention can be corresponding to a preset identifier, and the preset identifier can be: an identification of a preset component of the judicial text or an identification of a non-preset component of the judicial text. Specifically, the input of the first judicial text recognition model may be a character, or may be a vector matrix obtained by converting text content, and the present invention is not limited herein. The training text with the identifier of the preset component of the judicial text may be referred to as a positive sample, and the training text with the identifier of the non-preset component of the judicial text may be referred to as a negative sample. The method can perform machine learning through the positive sample and the negative sample, so as to obtain the first judicial text recognition model with higher recognition accuracy. The output of the first judicial text recognition model may be a probability vector with text content being a preset component of the judicial text. Specifically, the probability vector may be obtained according to probability transformation.

It is noted that the first probability vector may comprise a probability that the text content is a preset component of the judicial text, or may comprise a probability that the text content is not a preset component of the judicial text.

Of course, in practical application, the invention can use a plurality of different machine learning algorithms to perform machine learning on the training text, thereby obtaining a plurality of different first judicial text recognition models. Furthermore, the invention can also test each first judicial text recognition model through a plurality of test texts. The test sample used in the present invention may also correspond to a preset identifier. Through testing, the method and the device can obtain the first judicial text recognition model with the highest recognition accuracy from a plurality of different first judicial text recognition models, and take the model as the preset first judicial text recognition model.

The first probability vector obtaining unit 200 may be configured to convert the obtained text content into a vector matrix, and input the vector matrix into a preset first judicial text recognition model.

Specifically, the first probability vector obtaining unit 200 may include: the device comprises a word segmentation processing subunit, a word vector determining subunit and a converting subunit.

The word segmentation processing subunit is used for carrying out word segmentation processing on the text content to obtain a plurality of words;

the word vector determining subunit is used for determining word vectors corresponding to all the vocabularies in the plurality of vocabularies;

and the conversion subunit is used for converting the word vectors into a vector matrix according to the arrangement sequence of the vocabularies in the text content.

The second probability vector obtaining unit 300 is configured to compare the text content with a preset knowledge graph feature, and obtain a second probability vector that the text content is the preset component of the judicial text according to a comparison result, where the preset knowledge graph feature corresponds to the preset component.

The embodiment of the invention can compare the text content with the preset knowledge graph characteristics and judge whether the text content accords with the preset knowledge graph characteristics. The second probability vector may be a probability vector of whether the text content conforms to a preset knowledge-graph feature. According to the embodiment of the invention, the text content is compared with the preset knowledge graph characteristics, so that the obtained second probability vector can better accord with the judgment of a judicial field person on whether the text content is the preset component of the judicial text.

Optionally, the preset knowledge graph features include at least one of a regular expression, a template of the preset component, an entity vocabulary, and a concept vocabulary. The second probability vector obtaining unit 300 may specifically be configured to include at least one of the following determination processes:

and judging whether the text content conforms to the regular expression or not, and obtaining a first result vector according to a judgment result.

And judging whether the text content conforms to the template of the preset component part, and obtaining a second result vector according to the judgment result.

And judging whether the text content contains the entity vocabulary or not, and obtaining a third result vector according to a judgment result.

Wherein, the regular expression may include: "request & notice", "law & property", "request | please court", and the like. It is understood that the regular expression may be a combination of relationships between a plurality of phrases, and the embodiments of the present invention are not illustrated one by one.

Specifically, the template of the preset component part may be a template commonly used by persons related to the judicial field when writing a certain judicial text part.

In particular, the entity vocabulary may be a vocabulary of transactions that exist objectively and are distinguishable from each other.

In particular, a concept vocabulary may be a vocabulary of relatively existing things.

The second probability vector obtaining unit 300 may be further configured to obtain, according to at least one vector of the first result vector, the second result vector, the third result vector, and the fourth result vector, a second probability vector that the text content is the preset component of the judicial text.

According to the embodiment of the invention, the weights can be respectively distributed to the first probability vector and the second probability vector according to the accuracy of the first probability vector and the second probability vector. The accuracy of the first probability vector and the second probability vector may be detected by a person skilled in the art through training text. Optionally, the weight of the first probability vector is a first preset weight, and the weight of the second probability vector is a second preset weight, where a sum of the first preset weight and the second preset weight is 1.

The third probability vector obtaining unit 400 is configured to concatenate the first probability vector and the second probability vector into a third probability vector.

Specifically, the third probability vector obtaining unit 400 may include: a first weight vector obtaining subunit, a second weight vector obtaining subunit and a third probability vector obtaining subunit.

And the first weight vector obtaining subunit is used for multiplying the first probability vector by a first preset weight to obtain a first weight vector.

And the second weight vector obtaining subunit is used for multiplying the second probability vector by a second preset weight to obtain a second weight vector.

And the third probability vector obtaining subunit is used for splicing the first weight vector and the second weight vector into a third probability vector.

The recognition result obtaining unit 500 is configured to input the third probability vector into a preset second judicial text recognition model, and obtain a recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text.

The preset second judicial text recognition model can be obtained by performing machine learning on a plurality of probability vectors. The second judicial text recognition model can also use positive samples and negative samples to perform machine learning, or can be trained through a plurality of probability vectors first and then subjected to accuracy testing.

The structure of the text recognition model obtaining apparatus provided by this embodiment is shown in fig. 5, and may include: a training text obtaining unit 10, a fourth probability vector obtaining unit 20, a fifth probability vector obtaining unit 30, a sixth probability vector obtaining unit 40, and a judicial text recognition model obtaining unit 50.

The training text obtaining unit 10 is configured to obtain a plurality of training texts, where the training texts are text contents in judicial texts, the training texts correspond to preset identifiers, and the preset identifiers are: an identification of a preset component of the judicial text or an identification of a non-preset component of the judicial text.

The fourth probability vector obtaining unit 20 is configured to input the obtained training text into a preset first judicial text recognition model, and obtain a fourth probability vector that the training text output by the first judicial text recognition model is a preset component of the judicial text.

The fifth probability vector obtaining unit 30 is configured to compare the training text with a preset knowledge graph feature, and obtain a fifth probability vector that the training text is the preset component of the judicial text according to a comparison result, where the preset knowledge graph feature corresponds to the preset component.

The sixth probability vector obtaining unit 40 is configured to splice the fourth probability vector and the fifth probability vector into a sixth probability vector.

The judicial text recognition model obtaining unit 50 is configured to perform machine learning on the sixth probability vector according to the preset identifier corresponding to the training text to obtain a second judicial text recognition model, where the input of the second judicial text recognition model is: the probability vector output by the first judicial text recognition model is spliced with the probability vector obtained according to the comparison result of the text content and the preset knowledge map characteristics, and the output of the second judicial text recognition model is as follows: the text content is a recognition result of the preset component of the judicial text.

Specifically, the second judicial text recognition model may recognize whether the text content is the preset component of the judicial text.

It can be understood that, in the embodiment of the present invention, the recognition result that the text content output by the second judicial text recognition model is not the preset component of the judicial text can also be obtained.

The judicial text recognition device provided by the embodiment of the invention can obtain the text content in the judicial text; inputting the obtained text content into a preset first judicial text recognition model, and obtaining a first probability vector of the text content output by the first judicial text recognition model as a preset component of a judicial text; comparing the text content with preset knowledge graph characteristics, and obtaining a second probability vector of the text content being the preset component of the judicial text according to a comparison result, wherein the preset knowledge graph characteristics correspond to the preset component; concatenating the first probability vector and the second probability vector into a third probability vector; and inputting the third probability vector into a preset second judicial text recognition model, and obtaining a recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text. The embodiment of the invention overcomes the technical problem of low accuracy in the prior art by splicing the first probability vector and the second probability vector of the text content into the third probability vector and inputting the third probability vector into the second judicial text recognition model to obtain the recognition result.

Optionally, as shown in fig. 6, another judicial text recognition apparatus provided in this embodiment may further include: the rule checking unit 600 is configured to check the rule,

the rule checking unit 600 is configured to, after the recognition result obtaining unit 500 obtains the recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text, check whether the text content meets a preset checking rule, and if not, determine that the text content is the preset component of the judicial text.

The present embodiment provides a storage medium on which a program is stored, the program implementing the judicial text recognition method and/or the text recognition model obtaining method as described in any one of the above when executed by a processor.

As shown in fig. 7, the electronic device 700 provided in this embodiment includes at least one processor 701, at least one memory 702 connected to the processor 701, and a bus 703; the processor 701 and the memory 702 complete communication with each other through the bus 703; the processor 701 is configured to invoke program instructions in the memory 702 to perform the judicial text recognition method, and/or the text recognition model obtaining method, as described in any of the above.

The electronic device herein may be a server, a PC, a PAD, a mobile phone, etc.

The judicial text recognition apparatus includes a processor 701 and a memory 702, the text content obtaining unit 100, the first probability vector obtaining unit 200, the second probability vector obtaining unit 300, the third probability vector obtaining unit 400, the recognition result obtaining unit 500, and the like are stored in the memory 702 as program units, and the processor 701 executes the program units stored in the memory 702 to implement corresponding functions.

The processor 701 includes a kernel, and the kernel calls a corresponding program unit from the memory 702. The kernel can be set to be one or more than one, and the recognition result that the text content is the preset component of the judicial text is obtained by adjusting the kernel parameters.

The memory 702 may include volatile memory 702 in a computer readable medium, random access memory 702(RAM) and/or a non-volatile memory such as read only memory 702(ROM) or flash memory (flash RAM), and the memory 702 includes at least one memory chip.

The embodiment of the invention provides a processor 701, wherein the processor 701 is used for running a program, and the judicial text recognition method and/or the text recognition model obtaining method are/is executed when the program runs.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:

acquiring text content in a judicial text;

The present application also provides another computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, the computing device includes one or more processors 701 (CPUs), input/output interfaces, network interfaces, and memory.

The memory 702 may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory 702 is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for recognizing judicial texts, comprising:

acquiring text content in a judicial text;

2. The method of claim 1, wherein the predetermined knowledge-graph characteristics comprise at least one of a regular expression, a template of the predetermined component, a solid vocabulary, a concept vocabulary,

3. The method according to claim 1, wherein after the inputting the third probability vector into a preset second judicial text recognition model and obtaining a recognition result that the text content output by the second judicial text recognition model is the preset component of the judicial text, the method further comprises:

4. The method according to claim 1, wherein the weight of the first probability vector is a first preset weight, and the weight of the second probability vector is a second preset weight, wherein the sum of the first preset weight and the second preset weight is 1.

5. A text recognition model obtaining method is characterized by comprising the following steps:

6. A judicial text recognition device, comprising: a text content obtaining unit, a first probability vector obtaining unit, a second probability vector obtaining unit, a third probability vector obtaining unit and a recognition result obtaining unit,

7. The apparatus of claim 6, further comprising: the rule checking unit is used for checking the rule,

8. A text recognition model obtaining apparatus, comprising: a training text obtaining unit, a fourth probability vector obtaining unit, a fifth probability vector obtaining unit, a sixth probability vector obtaining unit and a judicial text recognition model obtaining unit,

9. A storage medium on which a program is stored, the program implementing the judicial text recognition method according to any one of claims 1 to 4 and/or the text recognition model obtaining method according to claim 5 when executed by a processor.

10. An electronic device comprising at least one processor, and at least one memory connected to the processor, a bus; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the judicial text recognition method of any one of claims 1 to 4 and/or the text recognition model obtaining method of claim 5.