CN112507968A - Method and device for identifying official document text based on feature association - Google Patents

Method and device for identifying official document text based on feature association Download PDF

Info

Publication number
CN112507968A
CN112507968A CN202011551817.XA CN202011551817A CN112507968A CN 112507968 A CN112507968 A CN 112507968A CN 202011551817 A CN202011551817 A CN 202011551817A CN 112507968 A CN112507968 A CN 112507968A
Authority
CN
China
Prior art keywords
text
identification
vector
recognition
identifying
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011551817.XA
Other languages
Chinese (zh)
Other versions
CN112507968B (en
Inventor
李巧
朱永强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Wangan Technology Development Co ltd
Original Assignee
Chengdu Wangan Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Wangan Technology Development Co ltd filed Critical Chengdu Wangan Technology Development Co ltd
Priority to CN202011551817.XA priority Critical patent/CN112507968B/en
Publication of CN112507968A publication Critical patent/CN112507968A/en
Application granted granted Critical
Publication of CN112507968B publication Critical patent/CN112507968B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Discrimination (AREA)

Abstract

The application provides a method and a device for identifying a document text based on feature association, and relates to the technical field of text identification. In the application, firstly, the text to be recognized is recognized based on the recognition elements of the official document text, so as to obtain the recognition result of each recognition element. And secondly, constructing a target text vector based on the obtained recognition result. And then, respectively updating the target text vector based on the target position information and the weight coefficient to obtain a first text vector and a second text vector, wherein the target position information comprises the position information of the identification element corresponding to each first identification value in the target text vector in the text to be identified, and the weight coefficient is obtained based on the processing of the official text sample. And finally, determining whether the text to be recognized belongs to the official document text or not based on the first text vector, the second text vector and the text probability threshold. Based on the method, the problem that effective recognition of the official document text is difficult to perform based on the prior art can be solved.

Description

Method and device for identifying official document text based on feature association
Technical Field
The application relates to the technical field of text recognition, in particular to a method and a device for recognizing a document text based on feature association.
Background
The official document refers to the document of handling the official business of the state organ, the enterprise and public institution and the people group, and is an important tool for transmitting the implementation policy and policy, issuing the regulation, asking and answering the question, guiding and negotiating the work, reporting the situation, exchanging the experience and the like. And, have the characteristics of many kinds, huge quantity.
In the existing text recognition technology, most neural networks can realize text classification, for example, classification of categories such as finance, sports, entertainment, games and the like. However, the inventor researches and finds that the neural network cannot well judge the official document text for the official document text, and has no interpretability, so that the problem that the official document text is difficult to be effectively identified exists.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method and an apparatus for identifying a document text based on feature association, so as to solve the problem that effective identification of the document text is difficult based on the prior art.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
a method for identifying official document texts based on feature association comprises the following steps:
identifying a text to be identified based on a plurality of identification elements of a document text to obtain an identification result corresponding to each identification element, wherein the identification result comprises a first identification value or a second identification value, the first identification value is used for representing that the text to be identified has the corresponding identification element, and the second identification value is used for representing that the text to be identified does not have the corresponding identification element;
constructing a target text vector based on the obtained multiple recognition results, wherein the dimension number of the target text vector is the number of the multiple recognition elements;
updating the target text vector based on pre-obtained target position information and a weighting coefficient respectively to obtain a corresponding first text vector and a corresponding second text vector, wherein the target position information comprises position information of an identification element corresponding to each first identification value in the target text vector in the text to be identified, and the weighting coefficient is obtained based on processing of a document text sample;
and determining whether the text to be recognized belongs to a official document text or not based on the first text vector, the second text vector and a predetermined text probability threshold.
In a preferred option of the embodiment of the present application, in the method for identifying a document text based on feature association, the step of identifying the text to be identified based on a plurality of identification elements included in the document text to obtain an identification result corresponding to each identification element includes:
aiming at each identification element in a plurality of identification elements of the official document text, at least one corresponding text identification thread is created for the identification element;
and aiming at each text recognition thread, carrying out recognition processing on the corresponding recognition element in the text to be recognized through the text recognition thread to obtain a recognition result corresponding to the recognition element.
In a preferred option of the embodiment of the present application, in the method for identifying a document based on feature association, the step of obtaining an identification result corresponding to the identification element by identifying, for each of the text identification threads, the corresponding identification element in the document to be identified by the text identification thread includes:
identifying the share number in the line head area of each line of the text to be identified according to a predetermined first regular expression through the text identification thread corresponding to the share number to obtain an identification result corresponding to the share number;
identifying the security level in the line head area of each line of the text to be identified according to a predetermined second regular expression through the text identification thread corresponding to the security level to obtain an identification result corresponding to the security level;
identifying the confidentiality term in the line head area of each line of the text to be identified according to a predetermined third regular expression through the text identification thread corresponding to the confidentiality term to obtain an identification result corresponding to the confidentiality term;
identifying the emergency degree in the head region of each line of the text to be identified according to a predetermined fourth regular expression by the text identification thread corresponding to the emergency degree, so as to obtain an identification result corresponding to the emergency degree;
identifying the sending organ mark in the head region of each line of the text to be identified according to a predetermined fifth regular expression through the text identification thread corresponding to the sending organ mark to obtain an identification result corresponding to the sending organ mark;
identifying the text sending number in the head area of each line of the text to be identified according to a predetermined sixth regular expression through the text identification thread corresponding to the text sending number to obtain an identification result corresponding to the text sending number;
identifying the title in the head region of each line of the text to be identified according to a predetermined seventh regular expression by the text identification thread corresponding to the title to obtain an identification result corresponding to the title;
and identifying the accessory description in the head region of each line of the text to be identified according to a predetermined eighth regular expression through the text identification thread corresponding to the accessory description to obtain an identification result corresponding to the accessory description.
In a preferred alternative of the method for recognizing a document based on a feature association according to an embodiment of the present invention, the plurality of recognition elements includes a sending agency logo, a sending agency signature, a copying agency, an issuer, a composition date, and a printing date, and the step of obtaining a recognition result corresponding to each of the recognition elements by performing a recognition process on the corresponding recognition element in a document to be recognized based on the text recognition thread for each of the recognition threads includes:
identifying the sending office sign in the text to be identified according to the mechanism name through the text identification thread corresponding to the sending office sign, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the sending office sign;
identifying the main sending organ in the text to be identified according to the mechanism name through a text identification thread corresponding to the main sending organ, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the main sending organ;
performing recognition processing on the sender signature in the text to be recognized according to the organization name by a text recognition thread corresponding to the sender signature, and generating a recognition result corresponding to the sender signature in response to a user's identification operation on a result of the recognition processing;
identifying the copying mechanism in the text to be identified according to the mechanism name through the text identification thread corresponding to the copying mechanism, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the copying mechanism;
identifying the issuer in the text to be identified according to the name of the person through the text identification thread corresponding to the issuer to obtain an identification result corresponding to the issuer;
identifying the text-forming date in the text to be identified according to the date through the text identification thread corresponding to the text-forming date, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the text-forming date;
and identifying the printing date in the text to be identified according to the date through the text identification thread corresponding to the printing date, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the printing date.
In a preferred option of the embodiment of the present application, in the method for identifying a document based on feature association, the step of updating the target text vector based on the pre-obtained target location information and the pre-obtained weight coefficient to obtain a corresponding first text vector and a corresponding second text vector includes:
for each first identification value in the target text vector, obtaining the position information of the identification element corresponding to the first identification value in the text to be identified;
aiming at the position information of each identification element, obtaining a corresponding Gaussian distribution value based on the position information and a Gaussian distribution formula corresponding to the identification element, wherein a mean value parameter and a standard deviation parameter of the Gaussian distribution formula are determined based on the position information of the identification element in a plurality of official document text samples;
and aiming at each obtained Gaussian distribution value, updating the first identification value corresponding to the Gaussian distribution value based on the Gaussian distribution value to obtain a corresponding first text vector.
In a preferred option of the embodiment of the present application, in the method for identifying a document based on feature association, the step of updating the target text vector based on the pre-obtained target location information and the pre-obtained weight coefficient to obtain a corresponding first text vector and a corresponding second text vector includes:
processing a plurality of official document text samples to obtain a weight coefficient;
and updating the target text vector based on the weight coefficient to obtain a corresponding second text vector, wherein the updating comprises multiplying the weight coefficient by the target text vector.
In a preferred option of the embodiment of the present application, in the method for identifying a document text based on feature association, the step of processing the plurality of document text samples to obtain a weight coefficient includes:
for each official document text sample, constructing an element list corresponding to the official document text sample based on the identification elements included in the official document text sample, wherein the number of the official document text samples is multiple;
constructing a frequent n item set based on a plurality of identification elements included in the plurality of constructed requirement lists to obtain a plurality of frequent n item sets, wherein n includes each integer between 1 and the number of the plurality of identification elements;
for each frequent n item set, obtaining the support degree of the frequent n item set based on the number of times of the frequent n item set appearing in a plurality of element lists and the number of the element lists;
determining a target frequent n-item set based on each first identification value in the target text vector in the plurality of frequent n-item sets;
and summing the support degrees of the target frequent n item sets to obtain a weight coefficient.
In a preferred option of the embodiment of the present application, in the method for identifying a document text based on feature association, the step of determining whether the text to be identified belongs to the document text based on the first text vector, the second text vector and a predetermined text probability threshold includes:
carrying out vector combination processing on the basis of the first text vector and the second text vector to obtain a third text vector;
obtaining a probability value of the text to be recognized belonging to the official text based on the third text vector;
and judging whether the text to be recognized belongs to the official document text or not based on the probability value and a predetermined text probability threshold, wherein if the probability value is greater than or equal to the text probability threshold, the text to be recognized is judged to belong to the official document text.
In a preferred selection of the embodiment of the present application, in the method for identifying an official document text based on feature association, the step of obtaining a probability value that the text to be identified belongs to the official document text based on the third text vector includes:
determining whether each vector value in the third text vector is smaller than a preset threshold value;
updating each vector value smaller than the preset threshold value to be 0, and updating each vector value larger than or equal to the preset threshold value to be the preset threshold value;
and calculating the sum of the updated vector values, and taking the sum as the probability value of the text to be recognized belonging to the official document text.
The embodiment of the present application further provides a device for identifying a document text based on feature association, including:
the identification module of the text to be identified is used for identifying the text to be identified based on a plurality of identification elements of the official document text to obtain an identification result corresponding to each identification element, wherein the identification result comprises a first identification value or a second identification value, the first identification value is used for representing that the text to be identified has the corresponding identification element, and the second identification value is used for representing that the text to be identified does not have the corresponding identification element;
the text vector construction module is used for constructing a target text vector based on the obtained multiple identification results, wherein the dimension number of the target text vector is the number of the multiple identification elements;
the text vector updating module is used for respectively updating the target text vector based on pre-obtained target position information and a weighting coefficient to obtain a corresponding first text vector and a corresponding second text vector, wherein the target position information comprises position information of an identification element corresponding to each first identification value in the target text vector in the text to be identified, and the weighting coefficient is obtained by processing a document text sample;
and the official document text determination module is used for determining whether the text to be recognized belongs to the official document text or not based on the first text vector, the second text vector and a predetermined text probability threshold.
According to the method and the device for identifying the official document based on the feature association, the corresponding identification result is obtained by identifying the text to be identified through the identification elements of the official document, so that a target text vector can be constructed based on the identification result, the target text vector is processed based on the position information of the identification elements in the text to be identified and the weighting coefficient obtained based on the official document sample, a first text vector and a second text vector are obtained, and then whether the text to be identified belongs to the official document can be determined based on the first text vector and the second text vector and in combination with the predetermined text probability threshold. Based on the method, the characteristic association between the text to be recognized and the official document text sample can be realized due to the adoption of the weight coefficient obtained based on the official document text sample, so that the official document text is effectively recognized, the problem that the official document text is difficult to effectively recognize based on the prior art is solved, and the method has high practical value.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Fig. 2 is a schematic flowchart of a method for identifying a document text based on feature association according to an embodiment of the present application.
Fig. 3 is a flowchart illustrating sub-steps included in step S120 in fig. 2.
Fig. 4 is a flowchart illustrating sub-steps included in step S130 in fig. 2.
Fig. 5 is a flowchart illustrating other sub-steps included in step S130 in fig. 2.
Fig. 6 is a flowchart illustrating sub-steps included in step S140 in fig. 2.
Fig. 7 is a block diagram illustrating an apparatus for identifying a document text based on feature association according to an embodiment of the present application.
Icon: 10-an electronic device; 12-a memory; 14-a processor; 100-a document text recognition device based on feature association; 110-a text recognition module to be recognized; 120-a text vector construction module; 130-text vector update module; 140-official text determination module.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, an embodiment of the present application provides an electronic device 10 that may include a memory 12, a processor 14, and a device 100 for feature association-based official document text recognition.
Wherein the memory 12 and the processor 14 are electrically connected directly or indirectly to realize data transmission or interaction. For example, they may be electrically connected to each other via one or more communication buses or signal lines. The device 100 for recognizing official document text based on feature association includes at least one software function module which can be stored in the memory 12 in the form of software or firmware (firmware). The processor 14 is configured to execute an executable computer program stored in the memory 12, such as a software functional module and a computer program included in the device 100 for identifying a document text based on feature association, so as to implement the method for identifying a document text based on feature association provided by the embodiment of the present application.
Alternatively, the Memory 12 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.
The Processor 14 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), a System on Chip (SoC), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
It is understood that the structure shown in fig. 1 is only an illustration, and the electronic device 10 may further include more or fewer components than those shown in fig. 1, or have a different configuration from that shown in fig. 1, for example, a communication unit for performing information interaction with other devices (for example, when the electronic device 10 is a server, the other devices may be terminal devices such as a mobile phone and a computer).
With reference to fig. 2, an embodiment of the present application further provides a method for identifying a document text based on feature association, which is applicable to the electronic device 10. Wherein the method steps defined by the flow related to the method for identifying the official document text based on the feature association can be realized by the electronic device 10.
The specific process shown in FIG. 2 will be described in detail below.
Step S110, identifying the text to be identified based on the plurality of identification elements of the official document text to obtain the identification result corresponding to each identification element.
In this embodiment, after obtaining the text to be recognized, the electronic device may perform recognition processing on the text to be recognized based on a plurality of recognition elements included in the document text, to obtain a recognition result corresponding to each of the plurality of recognition elements, and thus, may obtain a plurality of recognition results.
The recognition result comprises a first recognition value or a second recognition value, the first recognition value is used for representing that the text to be recognized has corresponding recognition elements, and the second recognition value is used for representing that the text to be recognized does not have corresponding recognition elements.
That is, for each of the plurality of recognition elements, if the recognition element is present in the text to be recognized, the corresponding recognition result may be a first recognition value, and if the recognition element is not present in the text to be recognized, the corresponding recognition result may be a second recognition value.
And step S120, constructing a target text vector based on the obtained multiple recognition results.
In this embodiment, after obtaining the plurality of recognition results based on step S110, the electronic device may construct a target text vector based on the plurality of recognition results.
Wherein the dimension number of the target text vector is the number of the plurality of identification elements.
Step S130, updating the target text vector based on the pre-obtained target position information and the weighting coefficient respectively to obtain a corresponding first text vector and a corresponding second text vector.
In this embodiment, after obtaining the target text vector based on step S120, the target text vector may be updated based on the pre-obtained target position information and the weighting coefficient, so that the corresponding first text vector and the corresponding second text vector may be obtained.
That is, the target text vector may be updated based on the target location information, so as to obtain a first text vector. The target text vector may be updated based on the weight coefficient to obtain a second text vector.
The target position information includes position information of an identification element (i.e., an identification element in the text to be identified) in the text to be identified, where the identification element corresponds to each first identification value in the target text vector (the identification element in the text to be identified is not included, and there is no position information), and the weight coefficient is obtained by processing a document text sample.
Step S140, determining whether the text to be recognized belongs to the official document text or not based on the first text vector, the second text vector and a predetermined text probability threshold.
In this embodiment, after obtaining the first text vector and the second text vector based on step S130, it may be determined whether the text to be recognized belongs to a official document text based on the first text vector and the second text vector and based on a predetermined text probability threshold.
Based on the method, the weight coefficient obtained based on the official document text sample is adopted, so that the characteristic association between the text to be recognized and the official document text sample can be realized, the official document text can be effectively recognized, and the problem that the official document text is difficult to effectively recognize based on the prior art is solved.
In the first aspect, it should be noted that, in step S110, a specific manner of performing recognition processing on the text to be recognized is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, each of the recognition elements may be sequentially queried (traversed) in the text to be recognized, so as to implement the recognition processing of the text to be recognized.
For another example, in another alternative example, in order to improve the efficiency of the identification process, in conjunction with fig. 3, step S110 may include step S111 and step S112, which are described in detail below.
Step S111 is to create at least one corresponding text recognition thread for each of a plurality of recognition elements included in the official document text.
In this embodiment, before performing the recognition processing based on the recognition element, at least one corresponding text recognition thread may be created for each of a plurality of recognition elements of the official document text.
Thus, for multiple recognition elements, multiple text recognition threads (a thread is the smallest unit that an operating system can perform operation scheduling, is included in a process, and is the actual operation unit in the process) can be obtained.
Step S112, for each text recognition thread, performing recognition processing on the corresponding recognition element in the text to be recognized through the text recognition thread, and obtaining a recognition result corresponding to the recognition element.
In this embodiment, after the text recognition thread is created based on step S111, for each text recognition thread, the text recognition thread may perform recognition processing on the recognition element corresponding to the text recognition thread in the text to be recognized, so as to obtain a recognition result corresponding to the recognition element. In this way, a plurality of recognition results can be obtained for a plurality of recognition elements.
Optionally, in the above example, the specific manner of creating the text recognition thread based on step S111 is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, for each of the recognition elements, a text recognition thread may be created for that recognition element. That is, there is a one-to-one correspondence between the recognition elements and the text recognition threads.
For another example, in another alternative example, one or more text recognition threads may be created for each of the recognition elements.
That is, one of the recognition elements corresponds to one or more text recognition threads (where, when one of the recognition elements corresponds to a plurality of text recognition threads, the recognition result of the recognition element is made to be multiple, so that a comprehensive determination can be performed based on the multiple recognition results to obtain a final recognition result, for example, when the recognition result is 0 or 1 and the recognition result corresponding to one recognition element is two, the xor result of the two recognition results can be used as the final recognition result).
Optionally, in the above example, the specific way of performing the recognition processing on the recognition element in the text to be recognized based on step S112 is not limited, and may be selected according to the actual application requirement.
For example, in an alternative example, the plurality of identification elements include a part number, a security level, a security period, an urgency level, a sending agency sign, a sending letter number, a title, and an attachment description (wherein the part number, the security level, the security period, the urgency level, the sending agency sign, and the sending letter number are terms known to the corresponding person in the first part of the document, and the title and the attachment description are terms known to the corresponding person in the main part of the document).
Based on this, step S112 may include the steps of:
identifying the share number in the line head area of each line of the text to be identified according to a predetermined first regular expression through the text identification thread corresponding to the share number to obtain an identification result corresponding to the share number; identifying the security level in the line head area of each line of the text to be identified according to a predetermined second regular expression through the text identification thread corresponding to the security level to obtain an identification result corresponding to the security level; identifying the confidentiality term in the line head area of each line of the text to be identified according to a predetermined third regular expression through the text identification thread corresponding to the confidentiality term to obtain an identification result corresponding to the confidentiality term; identifying the emergency degree in the head region of each line of the text to be identified according to a predetermined fourth regular expression by the text identification thread corresponding to the emergency degree, so as to obtain an identification result corresponding to the emergency degree; identifying the sending organ mark in the head region of each line of the text to be identified according to a predetermined fifth regular expression through the text identification thread corresponding to the sending organ mark to obtain an identification result corresponding to the sending organ mark; identifying the text sending number in the head area of each line of the text to be identified according to a predetermined sixth regular expression through the text identification thread corresponding to the text sending number to obtain an identification result corresponding to the text sending number; identifying the title in the head region of each line of the text to be identified according to a predetermined seventh regular expression by the text identification thread corresponding to the title to obtain an identification result corresponding to the title; and identifying the accessory description in the head region of each line of the text to be identified according to a predetermined eighth regular expression through the text identification thread corresponding to the accessory description to obtain an identification result corresponding to the accessory description.
It is understood that in the above example, the 8-middle text recognition threads corresponding to the share number, the security level, the security deadline, the urgency, the sending organ flag, the sending character number, the title, and the attachment description can be executed in parallel, so that the efficiency of the recognition processing can be substantially improved.
In addition, in the above example, specific contents of the first regular expression, the second regular expression, the third regular expression, the fourth regular expression, the fifth regular expression, the sixth regular expression, the seventh regular expression, and the eighth regular expression are not limited, and may be configured according to actual application requirements.
Wherein, in a specific application example, the first regular expression (i.e. the regular expression corresponding to the share number) is ^ (NO)? Andd {6} is used for identifying 6-bit Arabic numerals such as No 123456 or 123456. The second regular expression (namely the regular expression corresponding to the share number) is ^ secret, and is used for identifying and processing secret and the like. The third regular expression (namely the regular expression corresponding to the share number) is ^ secret [ secret by machine ] secret. The fourth regular expression (namely the regular expression corresponding to the share number) is ^ Tejiaping and is used for identifying and processing special urgency and the like. The fifth regular expression (namely, the regular expression corresponding to the share number) is a file $, and is used for identifying and processing the 'common central office file' and the like. The sixth regular expression (i.e., the regular expression corresponding to the signature) is {,9} [ ({ ] \ d {4} [ ] } and is used for performing recognition processing on "national hair (2012) number 12" and the like. The seventh regular expression (i.e., the regular expression corresponding to the copy number) is ^ about {,80} (let | decision | announcement | notification | announcement | protocol | report | replication | opinion | function | resolution | bulletin) for identification processing of "notification about safe production month development", or the like. The eighth regular expression (namely the regular expression corresponding to the share number) is ^ annex [: d, for "attachment: 1 ", and the like.
In the above example, the head area refers to a certain number of elements (such as characters, numbers, and letters) at the top of each line, so that since each text recognition thread performs recognition processing on the head area of each line of the text to be recognized, the recognition efficiency can be further improved.
In addition, in order to further improve the efficiency of recognition and improve the accuracy of recognition, when the head region is recognized, the head (the first character, number, letter, and other elements of each line) of each line can be recognized as a starting point. That is, when the identified element of the first position matches the identified element, it is then identified whether the identified element of the second position matches the identified element.
For another example, in another alternative example, the plurality of identification elements include a sending office logo, a sending office signature, a copying office, a signer, a composition date, and a printing date (wherein the sending office logo and the signer are terms known to the corresponding person in the text of the official title, the sending office signature and the composition date are terms known to the corresponding person in the text of the official title, and the copying office and the printing date are terms known to the corresponding person in the text of the official title).
Based on this, step S112 may include the steps of:
identifying the sending office sign in the text to be identified according to the mechanism name through the text identification thread corresponding to the sending office sign, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the sending office sign; identifying the main sending organ in the text to be identified according to the mechanism name through a text identification thread corresponding to the main sending organ, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the main sending organ; performing recognition processing on the sender signature in the text to be recognized according to the organization name by a text recognition thread corresponding to the sender signature, and generating a recognition result corresponding to the sender signature in response to a user's identification operation on a result of the recognition processing; identifying the copying mechanism in the text to be identified according to the mechanism name through the text identification thread corresponding to the copying mechanism, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the copying mechanism; identifying the issuer in the text to be identified according to the name of the person through the text identification thread corresponding to the issuer to obtain an identification result corresponding to the issuer; identifying the text-forming date in the text to be identified according to the date through the text identification thread corresponding to the text-forming date, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the text-forming date; and identifying the printing date in the text to be identified according to the date through the text identification thread corresponding to the printing date, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the printing date.
It is to be understood that, in the above example, performing the identification processing by the agency name may mean performing the identification processing in the text to be identified, and determining whether or not there is an agency name, such as "national sports central office", "city and residential building", and the like. The identification processing according to the name of the person may be that identification processing is performed in the text to be identified to determine whether the text has the name of the person, such as "zhang san". The recognition processing by date may mean that the recognition processing is performed in the text to be recognized, and whether there is a date, such as "10 months and 5 days in 2019" or the like, is determined.
When the organization name is identified, the identified organization name can be further determined to belong to a sending office sign, a master sending office, a sending office signature or a copying office based on the identification operation of the user. When the date is identified, the identified date can be further determined to belong to the composition date or the printing date based on the identification operation of the user.
It is understood that, in the above example, two examples of performing the identification process may be adopted at the same time, so that two identification results may be obtained for the sending office flag, and therefore, the two identification results need to be further processed, such as the xor processing described above, to obtain the final identification result of the sending office flag.
Moreover, because the sending-file mechanism mark corresponds to two text recognition threads, a mutual exclusion lock can be set for the two text recognition threads, so that only one thread is allowed to operate at the same time, and after the thread operation is finished, the other thread operates again.
In the second aspect, it should be noted that, in step S120, a specific manner for constructing the target text vector is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, the first identification value may be 1, the second identification value may be 0, and thus, the constructed target text vector may be a one-dimensional vector, such as [1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1 ].
In the third aspect, it should be noted that, in step S130, a specific manner of performing the update processing is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, when performing an update process based on the target position information to obtain the first text vector, in conjunction with fig. 4, step S130 may include step S131, step S132, and step S133, which are described in detail below.
Step S131, for each first identification value in the target text vector, obtaining position information of an identification element corresponding to the first identification value in the text to be identified.
In this embodiment, after obtaining the target text vector based on step S120, for each first identification value (1 in the above example) in the target text vector, the position information of the identification element corresponding to the first identification value (i.e. the identification element existing in the text to be identified) in the text to be identified may be obtained (for example, in an alternative example, the position information may refer to that the last word (element or character) of the identification element belongs to the several words (elements or characters) in the text to be identified).
In step S132, for the position information of each identification element, a corresponding gaussian distribution value is obtained based on the position information and the gaussian distribution formula corresponding to the identification element.
In this embodiment, after the position information is obtained based on step S131, for the position information of each identification element, a corresponding gaussian distribution value may be calculated based on the position information and a gaussian distribution formula corresponding to the identification element.
The mean parameter and the standard deviation parameter of the gaussian distribution formula are determined based on the location information of the identification element in the multiple document text samples (for example, for the identification element "share", the location information of the "share" in the multiple document text samples may be determined first to obtain multiple location information, and then mean calculation and standard deviation calculation are performed based on the multiple location information to obtain the mean parameter and the standard deviation parameter of the gaussian distribution formula corresponding to the "share").
That is, the identification element and the gaussian distribution formula may have a one-to-one correspondence relationship. And, the gaussian distribution formula may be:
Figure BDA0002857939310000161
where p (X) represents a gaussian distribution value of the position information X of the identification element, μ represents a mean parameter corresponding to the identification element, and σ represents a standard deviation parameter corresponding to the identification element. It is understood that a plurality of position information may exist in one identification element in the sample to be identified, so that a plurality of gaussian distribution values can be obtained, and thus, the maximum value of the gaussian distribution values can be selected for the updating process.
Step S133, for each obtained gaussian distribution value, updating the first identification value corresponding to the gaussian distribution value based on the gaussian distribution value, so as to obtain a corresponding first text vector.
In this embodiment, after obtaining the gaussian distribution values based on step S132, for each gaussian distribution value, a first identification value corresponding to the gaussian distribution value may be updated based on the gaussian distribution value (for example, the gaussian distribution value may be multiplied by the corresponding first identification value to obtain an updated first identification value, where an identification element corresponding to a second identification value is not in the text to be identified, and thus does not have corresponding location information, so that no update is performed), so as to obtain the updated first identification value, and thus, a first text vector corresponding to the target text vector may be obtained.
For another example, in another alternative example, when performing the update process based on the weight coefficient to obtain the second text vector, in conjunction with fig. 5, step S130 may include step S134 and step S135, which are described in detail below.
Step S134, processing a plurality of official document text samples to obtain a weight coefficient.
In this embodiment, when the target text vector obtained in step S120 needs to be updated based on the weight coefficient, a plurality of document text samples may be processed to obtain the weight coefficient.
Step S135, performing update processing on the target text vector based on the weight coefficient to obtain a corresponding second text vector.
In this embodiment, after obtaining the weighting factor based on step S134, the target text vector may be updated based on the weighting factor to obtain a corresponding second text vector.
Wherein the updating process includes multiplying the weight coefficient and the target text vector.
Optionally, in the above example, the specific way of obtaining the weight coefficient based on step S133 is not limited, and may be selected according to the actual application requirement.
For example, in an alternative example, in order to make the weighting coefficients have higher reliability, step S133 may include the steps of:
firstly, aiming at each official document text sample, constructing an element list corresponding to the official document text sample based on the identification elements included in the official document text sample, wherein the number of the official document text samples is multiple; secondly, constructing a frequent n item set based on a plurality of identification elements included in the plurality of constructed requirement lists to obtain a plurality of frequent n item sets, wherein n includes each integer between 1 and the number of the plurality of identification elements; then, for each frequent n item set, obtaining the support degree of the frequent n item set based on the number of times that the frequent n item set appears in the element lists and the number of the element lists; then, in the plurality of frequent n item sets, determining a target frequent n item set based on each first identification value in the target text vector; and finally, summing based on the support degree of the target frequent n item set to obtain a weight coefficient.
For the above steps, the present application provides a specific application example, which is described in detail as follows. In the application example, 4 official document text samples, namely a first official document text sample, a second official document text sample, a third official document text sample and a fourth official document text sample are included, and the element list of each official document text sample is shown in the following table.
Sample of official document text List of elements
First document text sample t1,t3,t7,t9,t13,t14
Second document text sample t1,t2,t3,t4,t6,t8,t9,t10,t11,t12,t13,t14
Third document text sample t3,t4,t7,t8,t10,t11,t12,t14
Fourth document text sample t1,t2,t3,t4,t5,t7,t8,t12,t13,t14
Based on this, frequent 1 item sets can be obtained, and the support degree corresponding to each frequent 1 item set is calculated (for example, for the identification element t)1The number of occurrences in the 4-element list is 3, and thus, the support is 3/4), as shown in the following table:
frequent 1 item set Degree of support
{t1} 3/4=0.75
{t2} 2/4=0.5
... ...
{t14} 4/4=1
Furthermore, frequent 2 item sets can be obtained, and the support degree corresponding to each frequent 2 item set can be calculated (for example, for the identification element t)1And t2The number of occurrences in the 4-element list is 2, and thus, the support is 2/4), as shown in the following table:
frequent 2 item set Degree of support
{t1,t2} 2/4=0.5
{t1,t3} 3/4=0.75
... ...
{t13,t14} 3/4=0.75
Based on the above, the support degrees corresponding to the frequent 3 item sets and each frequent 3 item set, the frequent 14 item sets and each frequent 14 item set can be obtained.
Then, based on each first recognition value in the target text vector, a target frequent n-item set is determined. For example, if the target text vector is [1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0], the determined target frequent n-term set (the target frequent n-term set is a non-empty subset formed by non-0 elements in the target text vector, ti refers to the ith element in the target text vector) may include: { t1, t3, t4, t5, t6, t7}, { t1, t3, t4, t5, t6}, { t1, t3, t4, t5, t7}, { t1, t3, t4, t6, t7}, { t1, t3, t5, t6, t7}, { t1, t4, t5, t6, t7}, { t3, t4, t5, t6, t7, t1, t3, t4, t5}, { t1, t3, t3, t3}, { t3, t3, t }, { t 72, t }, t 72, t3, t }, and { 3.
Therefore, the support degree corresponding to each target frequent n item set can be obtained, and then summation processing is carried out to obtain the weight coefficient.
In the fourth aspect, it should be noted that, in step S140, a specific manner for determining whether the text to be recognized belongs to the official document text is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, in order to improve the reliability of the recognition of the text to be recognized, in conjunction with fig. 6, step S140 may include step S141, step S142, and step S143, which is described in detail below.
Step S141, performing vector merging processing based on the first text vector and the second text vector to obtain a third text vector.
In this embodiment, after obtaining the first text vector and the second text vector based on step S130, the first text vector and the second text vector may be subjected to vector merging processing (e.g., adding the two vectors), and thus, a third text vector may be obtained.
And S142, obtaining a probability value of the text to be recognized belonging to the official document text based on the third text vector.
In this embodiment, after obtaining the third text vector based on step S141, a probability value that the text to be recognized belongs to the official document text may be obtained based on the third text vector.
And step S143, judging whether the text to be identified belongs to the official document text or not based on the probability value and a predetermined text probability threshold value.
In this embodiment, after obtaining the probability value based on step S142, it may be determined whether the text to be recognized belongs to the official document text based on the probability value and a predetermined text probability threshold (a specific value of the text probability threshold is not limited, for example, in an alternative example, the text probability threshold may be 3).
And if the probability value is greater than or equal to the text probability threshold value, judging that the text to be recognized belongs to the official document text. And if the probability value is smaller than the text probability threshold value, judging that the text to be recognized does not belong to the official document text.
Optionally, the specific manner of obtaining the probability value that the text to be recognized belongs to the official document text based on the step S142 is not limited, and may be selected according to the actual application requirements.
For example, in an alternative example, step S142 may include the steps of:
firstly, determining whether each vector value in the third text vector is smaller than a preset threshold value; secondly, updating each vector value smaller than the preset threshold value to be 0, and updating each vector value larger than or equal to the preset threshold value to be the preset threshold value; then, the sum of the updated vector values is calculated and used as the probability value of the text to be recognized belonging to the official document text.
The specific value of the preset threshold is not limited, and for example, in an alternative example, the preset threshold may be 1. That is, if the vector value is less than 1, the vector value is updated to 0; and if the vector value is greater than or equal to 1, updating the vector value to 1.
With reference to fig. 7, an apparatus 100 for identifying a document based on feature association is also provided in the embodiment of the present application, and is applicable to the electronic device 10. The apparatus 100 for identifying official document text based on feature association includes a text to be identified module 110, a text vector construction module 120, a text vector update module 130 and an official document text determination module 140.
The to-be-recognized text recognition module 110 may be configured to perform recognition processing on a to-be-recognized text based on a plurality of recognition elements included in the document text to obtain a recognition result corresponding to each recognition element, where the recognition result includes a first recognition value or a second recognition value, the first recognition value is used to represent that the to-be-recognized text has a corresponding recognition element, and the second recognition value is used to represent that the to-be-recognized text does not have a corresponding recognition element. In this embodiment, the to-be-recognized text recognition module 110 may be configured to execute step S110 shown in fig. 2, and reference may be made to the foregoing description of step S110 regarding relevant contents of the to-be-recognized text recognition module 110.
The text vector constructing module 120 may be configured to construct a target text vector based on the obtained multiple recognition results, where the dimension number of the target text vector is the number of the multiple recognition elements. In this embodiment, the text vector construction module 120 may be configured to perform step S120 shown in fig. 2, and reference may be made to the foregoing description of step S120 for relevant contents of the text vector construction module 120.
The text vector updating module 130 may be configured to update the target text vector based on pre-obtained target position information and a weighting coefficient, respectively, to obtain a corresponding first text vector and a corresponding second text vector, where the target position information includes position information of an identification element corresponding to each first identification value in the target text vector in the text to be identified, and the weighting coefficient is obtained based on processing a document text sample. In this embodiment, the text vector updating module 130 may be configured to execute step S130 shown in fig. 2, and reference may be made to the foregoing description of step S130 for relevant contents of the text vector updating module 130.
The official document text determination module 140 may be configured to determine whether the text to be recognized belongs to an official document text based on the first text vector, the second text vector and a predetermined text probability threshold. In this embodiment, the official document text determination module 140 may be configured to execute step S140 shown in fig. 2, and reference may be made to the description of step S140 for relevant contents of the official document text determination module 140.
In summary, according to the method and the device for identifying the official document based on the feature association provided by the application, the identification element of the official document is used for identifying the text to be identified to obtain the identification result, so that the target text vector can be constructed based on the identification result, and the target text vector is processed based on the position information of the identification element in the text to be identified and the weighting coefficient obtained based on the official document sample, so as to obtain the first text vector and the second text vector. Based on the method, the characteristic association between the text to be recognized and the official document text sample can be realized due to the adoption of the weight coefficient obtained based on the official document text sample, so that the official document text is effectively recognized, the problem that the official document text is difficult to effectively recognize based on the prior art is solved, and the method has high practical value.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for identifying official document texts based on feature association is characterized by comprising the following steps:
identifying a text to be identified based on a plurality of identification elements of a document text to obtain an identification result corresponding to each identification element, wherein the identification result comprises a first identification value or a second identification value, the first identification value is used for representing that the text to be identified has the corresponding identification element, and the second identification value is used for representing that the text to be identified does not have the corresponding identification element;
constructing a target text vector based on the obtained multiple recognition results, wherein the dimension number of the target text vector is the number of the multiple recognition elements;
updating the target text vector based on pre-obtained target position information and a weighting coefficient respectively to obtain a corresponding first text vector and a corresponding second text vector, wherein the target position information comprises position information of an identification element corresponding to each first identification value in the target text vector in the text to be identified, and the weighting coefficient is obtained based on processing of a document text sample;
and determining whether the text to be recognized belongs to a official document text or not based on the first text vector, the second text vector and a predetermined text probability threshold.
2. The method for recognizing the official document text based on the feature association as claimed in claim 1, wherein the step of recognizing the text to be recognized based on the plurality of recognition elements of the official document text to obtain the recognition result corresponding to each recognition element comprises:
aiming at each identification element in a plurality of identification elements of the official document text, at least one corresponding text identification thread is created for the identification element;
and aiming at each text recognition thread, carrying out recognition processing on the corresponding recognition element in the text to be recognized through the text recognition thread to obtain a recognition result corresponding to the recognition element.
3. The method for identifying an official document text based on feature association as claimed in claim 2, wherein the plurality of identification elements include a part number, a security level, a security period, an emergency degree, a sending organ mark, a sending character number, a title, and an attachment description, and the step of obtaining the identification result corresponding to the identification element by identifying the corresponding identification element in the text to be identified through the text identification thread for each text identification thread comprises:
identifying the share number in the line head area of each line of the text to be identified according to a predetermined first regular expression through the text identification thread corresponding to the share number to obtain an identification result corresponding to the share number;
identifying the security level in the line head area of each line of the text to be identified according to a predetermined second regular expression through the text identification thread corresponding to the security level to obtain an identification result corresponding to the security level;
identifying the confidentiality term in the line head area of each line of the text to be identified according to a predetermined third regular expression through the text identification thread corresponding to the confidentiality term to obtain an identification result corresponding to the confidentiality term;
identifying the emergency degree in the head region of each line of the text to be identified according to a predetermined fourth regular expression by the text identification thread corresponding to the emergency degree, so as to obtain an identification result corresponding to the emergency degree;
identifying the sending organ mark in the head region of each line of the text to be identified according to a predetermined fifth regular expression through the text identification thread corresponding to the sending organ mark to obtain an identification result corresponding to the sending organ mark;
identifying the text sending number in the head area of each line of the text to be identified according to a predetermined sixth regular expression through the text identification thread corresponding to the text sending number to obtain an identification result corresponding to the text sending number;
identifying the title in the head region of each line of the text to be identified according to a predetermined seventh regular expression by the text identification thread corresponding to the title to obtain an identification result corresponding to the title;
and identifying the accessory description in the head region of each line of the text to be identified according to a predetermined eighth regular expression through the text identification thread corresponding to the accessory description to obtain an identification result corresponding to the accessory description.
4. The method for recognizing a official document text based on a feature association as claimed in claim 2 or 3, wherein said plurality of recognition elements include a sending organ logo, a sending organ signature, a copying organ, an issuer, a literacy date and a printing date, and wherein said step of obtaining a recognition result corresponding to said recognition element by performing a recognition process on the corresponding recognition element in the text to be recognized based on said text recognition thread for each of said text recognition threads comprises:
identifying the sending office sign in the text to be identified according to the mechanism name through the text identification thread corresponding to the sending office sign, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the sending office sign;
identifying the main sending organ in the text to be identified according to the mechanism name through a text identification thread corresponding to the main sending organ, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the main sending organ;
performing recognition processing on the sender signature in the text to be recognized according to the organization name by a text recognition thread corresponding to the sender signature, and generating a recognition result corresponding to the sender signature in response to a user's identification operation on a result of the recognition processing;
identifying the copying mechanism in the text to be identified according to the mechanism name through the text identification thread corresponding to the copying mechanism, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the copying mechanism;
identifying the issuer in the text to be identified according to the name of the person through the text identification thread corresponding to the issuer to obtain an identification result corresponding to the issuer;
identifying the text-forming date in the text to be identified according to the date through the text identification thread corresponding to the text-forming date, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the text-forming date;
and identifying the printing date in the text to be identified according to the date through the text identification thread corresponding to the printing date, and responding to the identification operation of the user on the identification processing result to generate the identification result corresponding to the printing date.
5. The method according to claim 1, wherein the step of updating the target text vector based on the pre-obtained target position information and the weighting coefficient to obtain the corresponding first text vector and second text vector comprises:
for each first identification value in the target text vector, obtaining the position information of the identification element corresponding to the first identification value in the text to be identified;
aiming at the position information of each identification element, obtaining a corresponding Gaussian distribution value based on the position information and a Gaussian distribution formula corresponding to the identification element, wherein a mean value parameter and a standard deviation parameter of the Gaussian distribution formula are determined based on the position information of the identification element in a plurality of official document text samples;
and aiming at each obtained Gaussian distribution value, updating the first identification value corresponding to the Gaussian distribution value based on the Gaussian distribution value to obtain a corresponding first text vector.
6. The method for identifying the official document text based on the feature association as claimed in claim 1 or 5, wherein the step of updating the target text vector based on the pre-obtained target position information and the weighting coefficient to obtain the corresponding first text vector and second text vector comprises:
processing a plurality of official document text samples to obtain a weight coefficient;
and updating the target text vector based on the weight coefficient to obtain a corresponding second text vector, wherein the updating comprises multiplying the weight coefficient by the target text vector.
7. The method of claim 6, wherein the step of processing the plurality of official document text samples to obtain the weight coefficients comprises:
for each official document text sample, constructing an element list corresponding to the official document text sample based on the identification elements included in the official document text sample, wherein the number of the official document text samples is multiple;
constructing a frequent n item set based on a plurality of identification elements included in the plurality of constructed requirement lists to obtain a plurality of frequent n item sets, wherein n includes each integer between 1 and the number of the plurality of identification elements;
for each frequent n item set, obtaining the support degree of the frequent n item set based on the number of times of the frequent n item set appearing in a plurality of element lists and the number of the element lists;
determining a target frequent n-item set based on each first identification value in the target text vector in the plurality of frequent n-item sets;
and summing the support degrees of the target frequent n item sets to obtain a weight coefficient.
8. The method for identifying official document texts based on feature association as claimed in claim 1, wherein the step of determining whether the texts to be identified belong to official document texts based on the first text vector, the second text vector and a predetermined text probability threshold comprises:
carrying out vector combination processing on the basis of the first text vector and the second text vector to obtain a third text vector;
obtaining a probability value of the text to be recognized belonging to the official text based on the third text vector;
and judging whether the text to be recognized belongs to the official document text or not based on the probability value and a predetermined text probability threshold, wherein if the probability value is greater than or equal to the text probability threshold, the text to be recognized is judged to belong to the official document text.
9. The method for identifying official document texts based on feature association as claimed in claim 8, wherein the step of obtaining probability values of the texts to be identified belonging to the official document texts based on the third text vector comprises:
determining whether each vector value in the third text vector is smaller than a preset threshold value;
updating each vector value smaller than the preset threshold value to be 0, and updating each vector value larger than or equal to the preset threshold value to be the preset threshold value;
and calculating the sum of the updated vector values, and taking the sum as the probability value of the text to be recognized belonging to the official document text.
10. An apparatus for identifying a document text based on feature association, comprising:
the identification module of the text to be identified is used for identifying the text to be identified based on a plurality of identification elements of the official document text to obtain an identification result corresponding to each identification element, wherein the identification result comprises a first identification value or a second identification value, the first identification value is used for representing that the text to be identified has the corresponding identification element, and the second identification value is used for representing that the text to be identified does not have the corresponding identification element;
the text vector construction module is used for constructing a target text vector based on the obtained multiple identification results, wherein the dimension number of the target text vector is the number of the multiple identification elements;
the text vector updating module is used for respectively updating the target text vector based on pre-obtained target position information and a weighting coefficient to obtain a corresponding first text vector and a corresponding second text vector, wherein the target position information comprises position information of an identification element corresponding to each first identification value in the target text vector in the text to be identified, and the weighting coefficient is obtained by processing a document text sample;
and the official document text determination module is used for determining whether the text to be recognized belongs to the official document text or not based on the first text vector, the second text vector and a predetermined text probability threshold.
CN202011551817.XA 2020-12-24 2020-12-24 Document text recognition method and device based on feature association Active CN112507968B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011551817.XA CN112507968B (en) 2020-12-24 2020-12-24 Document text recognition method and device based on feature association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011551817.XA CN112507968B (en) 2020-12-24 2020-12-24 Document text recognition method and device based on feature association

Publications (2)

Publication Number Publication Date
CN112507968A true CN112507968A (en) 2021-03-16
CN112507968B CN112507968B (en) 2024-03-05

Family

ID=74923389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011551817.XA Active CN112507968B (en) 2020-12-24 2020-12-24 Document text recognition method and device based on feature association

Country Status (1)

Country Link
CN (1) CN112507968B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344821A1 (en) * 2016-05-25 2017-11-30 Ebay Inc. Document optical character recognition
CN109815500A (en) * 2019-01-25 2019-05-28 杭州绿湾网络科技有限公司 Management method, device, computer equipment and the storage medium of unstructured official document
CN110909122A (en) * 2019-10-10 2020-03-24 重庆金融资产交易所有限责任公司 Information processing method and related equipment
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for extracting official document abstract and computer readable storage medium
CN111626057A (en) * 2020-07-28 2020-09-04 南京中孚信息技术有限公司 Official document judgment method and judgment system based on named entity
CN111680490A (en) * 2020-06-10 2020-09-18 东南大学 Cross-modal document processing method and device and electronic equipment
CN111681670A (en) * 2019-02-25 2020-09-18 北京嘀嘀无限科技发展有限公司 Information identification method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344821A1 (en) * 2016-05-25 2017-11-30 Ebay Inc. Document optical character recognition
CN109815500A (en) * 2019-01-25 2019-05-28 杭州绿湾网络科技有限公司 Management method, device, computer equipment and the storage medium of unstructured official document
CN111681670A (en) * 2019-02-25 2020-09-18 北京嘀嘀无限科技发展有限公司 Information identification method and device, electronic equipment and storage medium
CN110909122A (en) * 2019-10-10 2020-03-24 重庆金融资产交易所有限责任公司 Information processing method and related equipment
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for extracting official document abstract and computer readable storage medium
CN111680490A (en) * 2020-06-10 2020-09-18 东南大学 Cross-modal document processing method and device and electronic equipment
CN111626057A (en) * 2020-07-28 2020-09-04 南京中孚信息技术有限公司 Official document judgment method and judgment system based on named entity

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张维冲;王芳;赵洪;张建光;: "基于政府公文结构解析的科技政策主题抽取与分析", 科学学研究, no. 07, 15 July 2020 (2020-07-15) *
黄良友;: "再论发文字号第四要素发文形式――以国务院和重庆市等省级政府及其办公厅发文为例", 档案学通讯, no. 03, 18 May 2015 (2015-05-18) *

Also Published As

Publication number Publication date
CN112507968B (en) 2024-03-05

Similar Documents

Publication Publication Date Title
Leung Naive bayesian classifier
KR20220045035A (en) Classification of data using information aggregated from multiple classification modules
CN112163153B (en) Industry label determining method, device, equipment and storage medium
CN109508373B (en) Method and device for calculating enterprise public opinion index and computer readable storage medium
CN108256721B (en) Task scheduling method, terminal device and medium
CN110162754B (en) Method and equipment for generating post description document
CN112052891A (en) Machine behavior recognition method, device, equipment and computer readable storage medium
CN117114142B (en) AI-based data rule expression generation method, apparatus, device and medium
CN112069230B (en) Data analysis method, device, equipment and storage medium
CN113656699A (en) User feature vector determination method, related device and medium
CN112507968A (en) Method and device for identifying official document text based on feature association
CN111552812B (en) Method, device and computer equipment for determining relationship category between entities
CN112527602A (en) Business data statistical method and device, computer equipment and storage medium
CN112256853A (en) Question generation method, device, equipment and computer readable storage medium
CN117392577A (en) Behavior recognition method for judicial video scene, storage medium and electronic device
CN116170162A (en) Selective consensus method, computer storage medium, and terminal device
CN116629423A (en) User behavior prediction method, device, equipment and storage medium
CN112560108B (en) Text reading mark-remaining method, device and system
CN114048330A (en) Risk conduction probability knowledge graph generation method, device, equipment and storage medium
CN112328709B (en) Entity labeling method and device, server and storage medium
Mohaghegh et al. Anomaly Detection in Text Data Sets using Character-Level Representation
CN110766091A (en) Method and system for identifying road loan partner
CN116170226A (en) Signaling behavior identification method, signaling behavior identification device, terminal equipment and storage medium
JP7312923B1 (en) Information processing system, information processing method and program
CN114338915B (en) Incoming call number risk identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant