CN115098629A - File processing method and device, server and readable storage medium - Google Patents

File processing method and device, server and readable storage medium Download PDF

Info

Publication number
CN115098629A
CN115098629A CN202210709132.6A CN202210709132A CN115098629A CN 115098629 A CN115098629 A CN 115098629A CN 202210709132 A CN202210709132 A CN 202210709132A CN 115098629 A CN115098629 A CN 115098629A
Authority
CN
China
Prior art keywords
file
paragraph
audited
type
types
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210709132.6A
Other languages
Chinese (zh)
Inventor
李宽
蒋宁
王洪斌
吴海英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN202210709132.6A priority Critical patent/CN115098629A/en
Publication of CN115098629A publication Critical patent/CN115098629A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Databases & Information Systems (AREA)
  • Technology Law (AREA)
  • Primary Health Care (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a file processing method, a file processing device, a server and a readable storage medium, and belongs to the technical field of file processing. The file processing method comprises the following steps: acquiring a file to be audited, and determining the file type of the file to be audited; acquiring a file specification corresponding to the file type and a plurality of first paragraph types of the file specification; classifying the paragraphs of the file to be audited to obtain a plurality of second paragraph types of the file to be audited; and determining the auditing result of the file to be audited according to the plurality of first paragraph types and the plurality of second paragraph types. By adopting the method and the device, the file auditing efficiency can be improved.

Description

File processing method and device, server and readable storage medium
Technical Field
The application belongs to the technical field of file processing, and particularly relates to a file processing method, a file processing device, a server and a readable storage medium.
Background
With the continuous development of intelligent devices (such as smart phones, notebook computers, desktop computers and the like), documents (such as contracts and the like) which need to be signed by different services can be completed on the intelligent devices, and the traffic of the different services is rapidly increased. In order to ensure the compliance of the files, the files need to be audited, and currently, the audit of the files usually refers to that a worker checks the contents of the files one by one to complete the audit of the files, so that the audit efficiency is low.
Disclosure of Invention
The embodiment of the application provides a file processing method, a file processing device, a server and a readable storage medium, which are used for improving the efficiency of file auditing.
In a first aspect, a file processing method is provided, including:
acquiring a file to be audited, and determining the file type of the file to be audited;
acquiring a file specification corresponding to the file type and a plurality of first paragraph types of the file specification;
classifying paragraphs of the file to be audited to obtain a plurality of second paragraph types of the file to be audited;
and determining the auditing result of the file to be audited according to the plurality of first paragraph types and the plurality of second paragraph types.
In a second aspect, there is provided a document processing apparatus, the apparatus comprising:
the determining module is used for acquiring a file to be audited and determining the file type of the file to be audited;
the acquisition module is used for acquiring a file specification corresponding to the file type and a plurality of first paragraph types of the file specification;
the classification module is used for classifying the paragraphs of the file to be audited to obtain a plurality of second paragraph types of the file to be audited;
and the auditing module is used for determining the auditing result of the file to be audited according to the plurality of first paragraph types and the plurality of second paragraph types.
In a third aspect, a server is provided, comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions, when executed by the processor, implementing the steps of the method according to the first aspect.
In a fourth aspect, a readable storage medium is provided, on which a program or instructions are stored, which when executed by a processor, implement the steps of the method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the steps of the method according to the first aspect.
In the embodiment of the application, a file to be checked is obtained first, because the types of the file include a plurality of types, and the file specifications corresponding to each type of the file are different, the file type of the file to be checked needs to be determined, then the file specification corresponding to the file type of the file to be checked and a plurality of first paragraph types of the file specification are obtained, paragraphs of the file to be checked are classified, a plurality of second paragraph types of the file to be checked are obtained, and finally, a checking result of the file to be checked is determined according to the plurality of first paragraph types and the plurality of second paragraph types. By classifying the paragraphs of the file to be checked and classifying the paragraphs of the file specification corresponding to the file to be checked, the first paragraph type of each paragraph in the file to be checked and the second paragraph type of each paragraph in the file specification can be determined, and by determining a paragraph type for each paragraph of the file to be checked and the file specification, the document can be checked only by determining whether the paragraph type of each paragraph in the file to be checked is matched with each paragraph type in the file specification, so that the document checking efficiency is improved without performing word-by-word matching.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of a document processing method provided by an embodiment of the present application;
FIG. 2 is a schematic view of a document processing apparatus provided in one embodiment of the present application;
fig. 3 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.
In order to ensure the quality of the files, the files need to be audited, but because the number of pages of the files is from dozens of pages to hundreds of pages, and the number of the files is also large, the existing manual file auditing is time-consuming and labor-consuming, the efficiency is low, and omission possibly exists. In order to solve the above problems, the present scheme provides a file processing method, which audits files through a computer, realizes automation of auditing, and improves file auditing efficiency.
The following describes in detail a file processing method, an apparatus, a server and a readable storage medium provided in the embodiments of the present application with specific embodiments and application scenarios thereof in conjunction with the accompanying drawings.
As shown in fig. 1, an embodiment of the present application provides a document processing method, which may include the contents shown in S101 to S104.
In S101, a file to be checked is obtained, and a file type of the file to be checked is determined.
The document to be checked may be a contract, and the document type includes, for example, a purchase and sale contract, a supply and utilization contract, a water, gas, and heat contract, a gift contract, a borrowing contract, a lease contract, a construction project contract, a transportation contract, a technical contract, a storage contract, a entrusting contract, and the like. Each contract has a corresponding contract specification, which refers to a standard for contract auditing and corresponds to different types of contracts, and for auditing, some specifications are written, for example: "all parties in the contract must write that they have delivered the positive and official copy of the trade contract for which they have been properly signed", "all parties in the contract must write that the proportion of investment in XX products is not higher than 10%", and so on.
The file type of the file to be checked may be determined in various manners, for example, the file may be classified according to file type information carried by the file itself, the file type may be determined according to a file title, the file may also be classified by using a file classification model by obtaining some keywords and paragraph information in the file, and the manner of classifying by using the file classification model is described in detail in the following embodiments, which is not specifically described in this embodiment. Wherein the paragraph information includes a head paragraph, a middle paragraph, and an end paragraph.
In S102, a file specification corresponding to the file type and a plurality of first paragraph types of the file specification are acquired.
After the file type of the file to be checked is determined, a file specification corresponding to the file type is obtained, the file specification includes a plurality of paragraphs, each paragraph has a corresponding paragraph type, the paragraph types of the paragraphs may be the same or different, and the determination is performed according to the specific content of the paragraph.
It should be noted that the plurality of first paragraph types of the file specification may be classified in advance, stored in the database, and may be directly called when used, or may be classified in the present application, and the plurality of first paragraph types of the file specification are determined and then stored in the database, and may be subsequently used directly, specifically, the present application is not limited thereto.
In S103, the paragraphs of the file to be audited are classified to obtain a plurality of second paragraph types of the file to be audited.
In the embodiment of the application, the file to be audited comprises a plurality of paragraphs, each paragraph can be classified and then matched with each paragraph type in the file specification to determine whether the file to be audited can be audited, and the file can be audited quickly by the paragraph classification mode without matching word by word, so that the file auditing efficiency is improved. Therefore, the file to be checked needs to be divided into a plurality of paragraphs, and then each paragraph is classified according to a preset method, so as to obtain a plurality of second paragraph types of the file to be checked.
The preset method may include multiple types, the paragraphs may be classified according to the keywords in each paragraph, or the paragraphs may also be classified according to the semantics of each paragraph, which is specifically described in detail in the following embodiments, and is not limited in this embodiment.
In S104, an audit result of the file to be audited is determined according to the plurality of first paragraph types and the plurality of second paragraph types.
It is worth to be noted that the plurality of second paragraph types of the file to be checked and the plurality of first paragraph types of the file specification are matched, the plurality of second paragraph types of the file to be checked and the plurality of first paragraph types of the file specification are matched in a one-to-one correspondence manner, or the plurality of first paragraph types of the file specification can find the matched second paragraph type in the file to be checked, it is indicated that the file to be checked passes the check, otherwise, the paragraphs which cannot be matched are marked or reminded, and then the manual check is performed.
In the embodiment of the application, a file to be checked is obtained first, because the types of the file include a plurality of types, and the file specifications corresponding to each type of the file are different, the file type of the file to be checked needs to be determined, then the file specification corresponding to the file type of the file to be checked and a plurality of first paragraph types of the file specification are obtained, paragraphs of the file to be checked are classified, a plurality of second paragraph types of the file to be checked are obtained, and finally, a checking result of the file to be checked is determined according to the plurality of first paragraph types and the plurality of second paragraph types. By classifying the paragraphs of the file to be checked and classifying the paragraphs of the file specification corresponding to the file to be checked, the first paragraph type of each paragraph in the file to be checked and the second paragraph type of each paragraph in the file specification can be determined, and by determining one paragraph type for each paragraph of the file to be checked and the file specification, the document checking efficiency can be improved only by determining whether the paragraph type of each paragraph in the file to be checked is matched with each paragraph type in the file specification, and word-by-word matching is not needed.
In one possible embodiment of the present application, determining the file type of the file to be audited may include: extracting file text data in a file to be checked; and determining the file type of the file to be checked according to the characteristic field information in the file text data.
That is to say, before determining the type of the file to be audited, the data of the file to be audited may be preprocessed to obtain the file text data of the file to be audited, for example, the file to be audited in the format of PDF, Word, PPT, etc. is converted into the file text data of plain text, so as to extract the keyword or paragraph information in the following. After the file text data in the file to be audited is extracted, the characteristic field information, such as keywords, structural characteristic information and the like, in the file text data can be obtained, the file type of the file to be audited can be determined according to the characteristic field information, so that the file specification corresponding to the file type of the file to be audited can be determined according to the file type of the file to be audited, the auditing result of the file to be audited can be more accurately determined by a computer, and the file auditing efficiency is improved.
The file type can be quickly judged according to some key words in the file text data, for example, the title of the file, if the file title is a house lease contract, the file to be audited can be determined as a lease file, but the types of the lease file are various, such as live property lease, real property lease, general lease, special lease, regular lease, irregular lease and the like.
In a possible embodiment of the present application, determining a file type of a file to be audited according to the characteristic field information in the file text data may include: acquiring key words and structural feature information in file text data; and inputting the keyword and the structural characteristic information into the file classification model, and performing classification processing to obtain the file type of the file to be checked. The structural feature information includes a title, a head paragraph, a middle paragraph, and an end paragraph of the file.
In the embodiment of the application, the files to be checked can be classified through the file classification model, the file types of the files to be checked are determined, and by utilizing the file classification model and processing key words and structural characteristic information, such as the title, the head paragraph, the middle paragraph and the end paragraph of the files, the classification can be more accurate, and the checking efficiency is further improved.
The file classification model can be a pre-acquired existing model, namely the file classification model can be trained on the existing model structure only by utilizing the existing model structure, the model structure does not need to be reconstructed, and resources are saved. The document classification model may be any model suitable for use in the field of neural networks.
The training of the document classification model may include: the method comprises the steps of obtaining keywords and structural feature information in a plurality of files carrying file type labels, inputting the keywords and the structural feature information into an initial classification model, carrying out model training to obtain file types of the files, adjusting parameters of the initial classification model according to loss between the file types and the file type labels carried by the files until the loss between the file types and the file type labels obtained through training is smaller than a first preset value, and obtaining the file classification model.
In the application, the obtained keywords and the structural feature information are input into the obtained file classification model for classification processing, and the file type of the file to be checked is obtained. The file classification method and the file classification device can input the keyword and the structural characteristic information into the file classification model to perform classification processing, and compared with the method that the type of the file is judged directly through the keyword of the file, such as a title and the like, the classification of the file classification method and the file classification device is more accurate, and the file auditing efficiency is further improved.
In one possible embodiment of the present application, obtaining a plurality of first paragraph types of the file specification may include: classifying the paragraphs of the file specification according to a preset method to obtain a plurality of first paragraph types of the file specification.
That is, if the document specification corresponding to the document type of the document to be checked does not have a corresponding paragraph type in the database, that is, the paragraph classification corresponding to the document specification is not found in the database, the document specification can be classified by a preset method to obtain a plurality of first paragraph types of the document specification, and then the first paragraph types can be stored in the database for subsequent use, so that the contract checking efficiency is improved.
The preset method may include multiple types, the paragraphs may be classified according to keywords in each paragraph, and the paragraphs may also be classified according to semantics of each paragraph, in this embodiment, the paragraph is classified in the same classification manner as the paragraph classification of the file to be checked, which is specifically shown as follows.
The preset method may include: dividing a file into a plurality of paragraphs, and acquiring keywords in each paragraph; determining the label type and the specification type of each paragraph according to the keywords; and obtaining a plurality of paragraph types of the file according to the label type and the specification type of each paragraph. The label type is used for characterizing the attribute of the paragraph, and the specification type comprises a quantization specification type and a descriptive specification type.
The file may be a file to be checked or an approved file, or a file specification.
Where the file specification may include a plurality of tag types, e.g., whether a certain section corresponds to which aspects of the file are rights, obligations, amounts, or others. The quantization specification type means that there are definite entities and relationships between the entities, for example, the file must state that the proportion of investment in XX product is not higher than 10%, where the entities are XX product and 10%, and the relationship between the entities is not higher than. The descriptive specification type means that there is no specific entity, is a descriptive specification, and requires semantic compliance with the specification, e.g., it must be stated in the document that parties have delivered a properly signed transaction document positive and applied a official seal. By classifying the files to be audited and the file specifications through the preset method, the types of all the paragraphs can be clear, later auditing is facilitated, and file auditing efficiency is further improved.
In the embodiment of the present application, the paragraph type in the file is a type obtained by combining the tag type and the specification type of the paragraph. For example, if the tag type of a certain paragraph of a file is a right attribute, and the specification type is a quantization specification type, then the paragraph type of the paragraph is "right + quantization specification type", and when matching with the file specification, the paragraph of "right + quantization specification type" in the file specification needs to be matched. The paragraphs in the file are divided according to the label type and the standard type of the paragraphs, so that the paragraph types are more definite, when the file to be audited is matched with the file standard, the matching is more accurate, and the file auditing efficiency is improved.
In a possible embodiment of the present application, determining a result of a review of a to-be-reviewed file according to a plurality of first paragraph types and a plurality of second paragraph types may include:
and if at least one of the second paragraph types is not matched with the first paragraph types, determining that the audit result is that the file to be audited is not audited.
That is to say, the paragraph types of the two are matched, whether the tag type of each paragraph of the file to be audited can be matched with the tag type of the paragraph in the file specification is checked, if at least one of the second paragraph types of the file to be audited is not matched with the first paragraph types, it is indicated that the content exceeding the specification of the file specification exists in the file to be audited, at this time, it is determined that the audit result of the file to be audited is that the audit of the file to be audited is failed, that is, the automatic audit of the file to be audited cannot pass. Furthermore, the file to be audited can be manually checked to judge whether the exceeding content meets the standard, audit holes are reduced, and file audit quality is improved.
Because each paragraph of the file includes a tag type and a specification type, when the tag types of the paragraphs of the file to be audited are judged to be matched with the tag types of the paragraphs in the file specification, the specification type of the file to be audited needs to be further judged so as to further determine whether the file to be audited meets the specification in the file specification, and further improve the quality of file auditing.
In one possible embodiment of the present application, the document processing method may further include the following steps.
Step one, if the plurality of second paragraph types are matched with the plurality of first paragraph types, the standard type of each paragraph of the file to be audited is further judged.
Wherein the specification types include a quantization specification type and a descriptive specification type. Quantization specification types refer to entities that have a clear definition and relationships between entities. A descriptive specification type refers to no specific entity, is a descriptive specification, and requires semantic compliance with the specification. Specifically, details of the embodiments have been described above, and the details of the embodiments are not repeated.
And step two, if the standard type of at least one paragraph of the file to be audited is the quantization standard type, acquiring the entity and the entity relation in each paragraph of the file to be audited. Because the quantization specification type refers to a file with a clear entity, and the relationship between the entity and the entity needs to be specified, when the specification type of the file to be checked is determined to be the quantization specification, the entity in the paragraph and the relationship between the entity and the entity need to be obtained.
And step three, if the entity and the entity relationship in each paragraph of the file to be audited are matched with the entity and the entity relationship in the file specification, judging whether the paragraph of the file to be audited has a paragraph of which the specification type is the descriptive specification type.
The entity and the entity relationship in the document specification may be stored in the database, and may be obtained together when the database obtains the document specification, or may be obtained in a paragraph of the contract specification when it is determined that the specification type of the document to be checked is the quantization specification, specifically, the actual application is taken as the standard, and this embodiment is not limited.
The entity and entity relationship matching the entity and entity relationship in each paragraph of the document to be checked into the document specification may specifically be: and judging whether the entities are consistent or not and whether the entity relationship between the entities is consistent or not. For example, the paragraph of the document to be reviewed is "the proportion of the money invested in XX is less than 10%", wherein the entities are: money, XX and 10%, and the physical relationship is as follows: lower than; the paragraph in the document specification is "the proportion of the document that must be written to invest in XX is not higher than 10%", where the entities are: XX, 10%, the physical relationship is: if the entity relationship is not matched, the fourth step is performed, if the entity relationship is not matched, whether a paragraph of the document to be checked exists in the paragraph of the document to be checked, wherein the paragraph of the document to be checked is of a descriptive standard type, and if the paragraph of the document to be checked exists in the paragraph of the document to be checked, the result of checking is that the document to be checked passes the checking.
And step four, if the entity and entity relationship in at least one paragraph in the file to be audited does not match the entity and entity relationship in the file specification, determining that the audit result is that the audit of the file to be audited is not passed.
It should be noted that the fact that the entity and the entity relationship in the paragraph of the document to be audited do not match the entity and the entity relationship in the document specification may mean that there is no corresponding paragraph in the document specification, or that the entity and the entity relationship in the paragraph of the document to be audited are different from the entity and the entity relationship in the paragraph of the document specification.
For example, the paragraph of the document to be reviewed is "the proportion of the money invested in XX is less than 10%", wherein the entities are: money, XX and 10%, and the physical relationship is as follows: if the file specification is lower than the standard, and the file specification does not have corresponding paragraphs, the auditing result is determined to be that the file to be audited is not approved. For another example, the paragraph of the document to be audited is "the proportion of investment of the money to XX is not lower than 10%", wherein the entities are: money, XX and 10%, and the physical relationship is as follows: not lower than; the paragraph of the document specification is "the document must state that the proportion of investment in XX is not higher than 10%", wherein the entity: XX, 10%, the physical relationship is: and if the result is not higher than the result, the entities in the two are matched, and the entity relationship is not matched, the result is determined that the audit of the file to be audited is not passed.
According to the embodiment of the application, under the condition that the plurality of second paragraph types of the file to be audited are matched with the plurality of first paragraph types in the file specification, the specification type of each paragraph of the file to be audited is further judged to determine whether the file to be audited conforms to the specification in the file specification, so that the file audit is step-based, the file audit process is clearer, and the file audit quality can be improved.
In one possible embodiment of the present application, the document processing method may further include the following steps.
Step one, if the standard type of at least one paragraph of the file to be audited is a descriptive standard type, obtaining the key words of each paragraph of the file to be audited. Since the descriptive specification type has no definite entity and needs to be judged according to the semantics, the semantics of the paragraph can be determined according to the keywords in the paragraph so as to determine whether the paragraph matches the file specification.
The obtaining of the keyword may be extracting the keyword in the paragraph according to the keyword in the keyword library, that is, some words in the paragraph are in the keyword library, and at this time, the word may be extracted as the keyword of the paragraph. The keywords in the paragraphs can also be obtained through a sequence labeling model, which can be a pre-obtained existing model, that is, the sequence labeling model can utilize the structure of the existing model, and only needs to be trained on the basis of the structure of the existing model, so that the model structure does not need to be reconstructed, and resources are saved. The sequence labeling Model may be a Hidden Markov Model (HMM), a Maximum Entropy Model (The Maximum Entropy Model), or a Conditional Random Field (CRF), and is not limited in this embodiment.
And step two, inputting the keywords of each paragraph of the file to be checked into a pre-acquired semantic model for semantic matching, and detecting whether the description of each paragraph of the file to be checked semantically meets the requirement of file specification.
The obtained keywords are input into a semantic model obtained in advance for training, and the semantics of the keywords can be output, for example, the words can be judged to be place names, dates, books, songs and the like.
The semantic model can be an existing model which is obtained in advance, namely the semantic model can be a structure which utilizes the existing model, an interface is trained only on the basis of the existing model structure, the model structure does not need to be reconstructed, and resources are saved.
The training of the semantic model may include: inputting a plurality of words with semantic labels into an initial semantic model for model training to obtain the semantics of the words, adjusting parameters of the initial semantic model according to the loss between the semantics of the words and the semantic labels until the loss between the semantics of the words obtained by training and the semantic labels carried by the words is less than a second preset value, and obtaining the semantic model.
In the method, the keywords of each paragraph of the file to be checked are input into the semantic model obtained through training for semantic matching, so that the semantics corresponding to the keywords are obtained, then matching is performed according to the semantics and the semantics of the corresponding paragraph in the file specification, and whether the description of each paragraph of the file to be checked is semantically in accordance with the file specification is detected.
And step three, if the description of each paragraph of the file to be audited is semantically in accordance with the requirement of the file specification, determining that the audit result is that the file to be audited is approved.
That is, if the semantics of the keyword obtained through the semantic model match the requirements of the document specification, for example: if the matching degree is more than 90%, the file to be checked meets the requirement of the file specification, and the file to be checked passes the checking.
And step four, if the description of at least one paragraph in the file to be audited does not meet the requirement of the file specification semantically, determining that the audit result is that the audit of the file to be audited is not passed.
Correspondingly, if the semantics of the keywords obtained through the semantic model are not matched with the requirements of the file specification, the file to be checked does not meet the requirements of the file specification, and the file to be checked is not checked.
According to the embodiment of the application, under the condition that the label types of the paragraphs of the file to be audited are matched with the label types of the paragraphs in the file specification, the specification type of the file to be audited is further judged to determine whether the file to be audited meets the specification of the file specification, so that the file audit is step-based, the file audit process is clearer, and the file audit quality can be improved.
It should be noted that the above-mentioned specification type of a paragraph refers to a paragraph in a document to be checked, and since the document to be checked may include a plurality of paragraphs, it is necessary to determine whether each paragraph is checked to be passed, and there may be a certain paragraph tag type as a right, and the specification type as a descriptive specification, and the tag type as the right and the specification type as the descriptive specification are taken as classification results of the paragraph, and are matched with the document specification to determine whether the paragraph meets the requirement of the document specification, after all paragraphs are determined, if all paragraphs meet the requirement of the document specification, it can be finally determined that the document to be checked is passed, and if one or several paragraphs do not meet the requirement of the document specification, it is determined that the check of the document is not passed. The method and the device have the advantages that the paragraphs of the file to be audited are classified and then matched with the paragraph types of the corresponding file specifications to determine whether the file to be audited passes the audit, manual audit is not needed, and file audit efficiency is improved.
Further, corresponding to the method shown in fig. 1, based on the same technical concept, an embodiment of the present application further provides a document processing apparatus, as shown in fig. 2, the document processing apparatus may include: a determination module 201, an acquisition module 202, a classification module 203, and an audit module 204.
The determining module 201 is configured to obtain a file to be checked and determine a file type of the file to be checked; an obtaining module 202, configured to obtain a question specification corresponding to a file type and multiple first paragraph types of the file specification; the classification module 203 is configured to classify paragraphs of the file to be audited to obtain a plurality of second paragraph types of the file to be audited; the auditing module 204 is configured to determine an auditing result of the file to be audited according to the plurality of first paragraph types and the plurality of second paragraph types.
In this embodiment, a determining module 201 first obtains a to-be-checked file, since the types of files include many types, and file specifications corresponding to each file type are different, the file type of the to-be-checked file needs to be determined, then an obtaining module 202 obtains the file specification corresponding to the file type of the to-be-checked file and a plurality of first paragraph types of the file specification, a classifying module 203 classifies paragraphs of the to-be-checked file to obtain a plurality of second paragraph types of the to-be-checked file, and finally a checking module 204 determines a checking result of the to-be-checked file according to the plurality of first paragraph types and the plurality of second paragraph types. By classifying the paragraphs of the file to be checked and classifying the paragraphs of the file specification corresponding to the file to be checked, the first paragraph type of each paragraph in the file to be checked and the second paragraph type of each paragraph in the file specification can be determined, and by determining one paragraph type for each paragraph of the file to be checked and the file specification, the document checking efficiency can be improved only by determining whether the paragraph type of each paragraph in the file to be checked is matched with each paragraph type in the file specification, and word-by-word matching is not needed.
In one possible embodiment of the present application, the determining module 201 is configured to: extracting file text data in a file to be checked; and determining the file type of the file to be checked according to the characteristic field information in the file text data.
In one possible embodiment of the present application, the determining module 201 is configured to: acquiring keywords and structural feature information in the text data of the file, wherein the structural feature information comprises a title, a first paragraph, a middle paragraph and an end paragraph of the file; and inputting the keyword and the structural characteristic information into the file classification model, and performing classification processing to obtain the file type of the file to be checked.
In one possible implementation of the present application, the obtaining module 202 is configured to: classifying the paragraphs of the file specification according to a preset method to obtain a plurality of first paragraph types of the file specification; the preset method comprises the following steps: dividing a file into a plurality of paragraphs, and acquiring a keyword in each paragraph; determining a label type and a specification type of each paragraph according to the keywords, wherein the label type is used for representing the attribute of the paragraph, and the specification type comprises a quantization specification type and a descriptive specification type; and obtaining a plurality of paragraph types of the file according to the label type and the specification type of each paragraph.
In one possible embodiment of the present application, the auditing module 204 is configured to: and if at least one of the plurality of second paragraph types is not matched with the plurality of first paragraph types, determining that the auditing result is that the auditing of the file to be audited is not passed.
In one possible embodiment of the present application, the auditing module 204 is configured to: if the plurality of second paragraph types are matched with the plurality of first paragraph types, judging the standard type of each paragraph of the file to be checked; if the standard type of at least one paragraph of the file to be checked is a quantitative standard type, acquiring an entity and an entity relation in each paragraph of the file to be checked; if the entity and the entity relationship in each paragraph of the file to be checked are matched with the entity and the entity relationship in the file specification, judging whether the paragraph of the file to be checked has a paragraph of which the specification type is a descriptive specification type; and if the entity and entity relationship in at least one paragraph in the file to be audited does not match the entity and entity relationship in the file specification, determining that the audit result is that the audit of the file to be audited is not passed.
In one possible embodiment of the present application, the auditing module 204 is configured to: if the specification type of at least one paragraph of the file to be audited is a descriptive specification type, acquiring a keyword of each paragraph of the file to be audited; inputting the keywords of each paragraph of the file to be checked into a pre-acquired semantic model for semantic matching, and detecting whether the description of each paragraph of the file to be checked semantically meets the requirement of file specification; if the description of each paragraph of the file to be audited is semantically in accordance with the requirement of the file specification, determining that the audit result is that the file to be audited passes the audit; and if the description of at least one paragraph in the file to be audited does not meet the requirement of the file specification semantically, determining that the audit result is that the audit of the file to be audited is not passed.
The document processing apparatus in the embodiment of the present application may be an apparatus, and may also be a component, an integrated circuit, or a chip in a server.
The file processing apparatus provided in this embodiment of the present application can implement each process implemented in the method embodiment of fig. 1, and is not described here again to avoid repetition.
Optionally, as shown in fig. 3, an embodiment of the present application further provides a server 300, which includes a processor 301, a memory 302, and a program or an instruction stored in the memory 302 and capable of running on the processor 301, where the program or the instruction is executed by the processor 301 to implement each process of the embodiment of the file processing method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The embodiment of the present application further provides a readable storage medium, on which a program or an instruction is stored, and when the program or the instruction is executed by a processor, the program or the instruction implements the processes of the embodiment of the file processing method provided in any one of the above embodiments. And the same technical effect can be achieved, and in order to avoid repetition, the description is omitted.
The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.
The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the embodiment of the file processing method provided in the foregoing embodiment, and the same technical effect can be achieved, and in order to avoid repetition, details are not repeated here.
It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A file processing method, comprising:
acquiring a file to be audited, and determining the file type of the file to be audited;
acquiring a file specification corresponding to the file type and a plurality of first paragraph types of the file specification;
classifying the paragraphs of the file to be audited to obtain a plurality of second paragraph types of the file to be audited;
and determining the auditing result of the file to be audited according to the plurality of first paragraph types and the plurality of second paragraph types.
2. The method according to claim 1, wherein the determining the file type of the file to be audited includes:
extracting file text data in the file to be audited;
and determining the file type of the file to be checked according to the characteristic field information in the file text data.
3. The method according to claim 2, wherein the determining the file type of the file to be audited according to the characteristic field information in the file text data comprises:
acquiring keywords and structural feature information in the text data of the file, wherein the structural feature information comprises a title, a first paragraph, a middle paragraph and an end paragraph of the file;
and inputting the keyword and the structural feature information into a file classification model, and performing classification processing to obtain the file type of the file to be checked.
4. The method of claim 1, wherein obtaining a plurality of first paragraph types of the file specification comprises:
classifying the paragraphs of the file specification according to a preset method to obtain a plurality of first paragraph types of the file specification;
the preset method comprises the following steps: dividing a file into a plurality of paragraphs, and acquiring keywords in each paragraph; determining a label type and a specification type of each paragraph according to the keywords, wherein the label type is used for characterizing the attributes of the paragraphs, and the specification types comprise a quantization specification type and a descriptive specification type; and obtaining a plurality of paragraph types of the file according to the label type and the specification type of each paragraph.
5. The method according to claim 4, wherein the determining the review result of the file to be reviewed according to the plurality of first paragraph types and the plurality of second paragraph types includes:
and if at least one of the second paragraph types is not matched with the first paragraph types, determining that the audit result is that the audit of the file to be audited is not passed.
6. The method of claim 5, further comprising:
if the plurality of second paragraph types are matched with the plurality of first paragraph types, judging the standard type of each paragraph of the file to be audited;
if the standard type of at least one paragraph of the file to be checked is a quantization standard type, acquiring an entity and an entity relation in each paragraph of the file to be checked;
if the entity and the entity relationship in each paragraph of the file to be checked are matched with the entity and the entity relationship in the file specification, judging whether a paragraph of which the specification type is a descriptive specification type exists in the paragraph of the file to be checked;
and if the entity and entity relationship in at least one paragraph in the file to be audited does not match the entity and entity relationship in the file specification, determining that the audit result is that the file to be audited is not approved.
7. The method of claim 6, further comprising:
if the specification type of at least one paragraph of the file to be audited is a descriptive specification type, acquiring a keyword of each paragraph of the file to be audited;
inputting the keywords of each paragraph of the file to be checked into a pre-acquired semantic model for semantic matching, and detecting whether the description of each paragraph of the file to be checked semantically meets the requirement of the file specification;
if the description of each paragraph of the file to be audited is semantically in accordance with the requirement of the file specification, determining that the audit result is that the file to be audited passes the audit;
and if the description of at least one paragraph in the file to be audited does not meet the requirement of the file specification semantically, determining that the audit result is that the file to be audited is not approved.
8. A document processing apparatus, characterized in that the apparatus comprises:
the determining module is used for acquiring a file to be audited and determining the file type of the file to be audited;
the acquisition module is used for acquiring a file specification corresponding to the file type and a plurality of first paragraph types of the file specification;
the classification module is used for classifying the paragraphs of the file to be audited to obtain a plurality of second paragraph types of the file to be audited;
and the auditing module is used for determining the auditing result of the file to be audited according to the plurality of first paragraph types and the plurality of second paragraph types.
9. A server, characterized in that the server comprises a processor, a memory and a program or instructions stored on the memory and executable on the processor, which program or instructions, when executed by the processor, implement the steps of the method according to any of claims 1-7.
10. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the method according to any one of claims 1-7.
CN202210709132.6A 2022-06-22 2022-06-22 File processing method and device, server and readable storage medium Pending CN115098629A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210709132.6A CN115098629A (en) 2022-06-22 2022-06-22 File processing method and device, server and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210709132.6A CN115098629A (en) 2022-06-22 2022-06-22 File processing method and device, server and readable storage medium

Publications (1)

Publication Number Publication Date
CN115098629A true CN115098629A (en) 2022-09-23

Family

ID=83292707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210709132.6A Pending CN115098629A (en) 2022-06-22 2022-06-22 File processing method and device, server and readable storage medium

Country Status (1)

Country Link
CN (1) CN115098629A (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117479A (en) * 2018-08-13 2019-01-01 数据地平线(广州)科技有限公司 A kind of financial document intelligent checking method, device and storage medium
KR102009901B1 (en) * 2018-10-30 2019-08-12 삼성에스디에스 주식회사 Method for comparative analysis of document and apparatus for executing the method
CN110502632A (en) * 2019-07-19 2019-11-26 平安科技(深圳)有限公司 Contract terms reviewing method, device, computer equipment and storage medium based on clustering algorithm
CN110852054A (en) * 2019-10-22 2020-02-28 华融融通(北京)科技有限公司 Automatic contract version comparison tool and method in field of undesirable asset management
CN111274782A (en) * 2020-02-25 2020-06-12 平安科技(深圳)有限公司 Text auditing method and device, computer equipment and readable storage medium
CN112330214A (en) * 2020-11-26 2021-02-05 杭州睿胜软件有限公司 Contract review method and device and readable storage medium
CN112699658A (en) * 2020-12-31 2021-04-23 科大讯飞华南人工智能研究院(广州)有限公司 Text comparison method and related device
US20210209173A1 (en) * 2018-02-08 2021-07-08 Trevor George Thompson Document analysis method and apparatus
CN113569001A (en) * 2021-01-29 2021-10-29 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and computer readable storage medium
CN113722501A (en) * 2021-08-06 2021-11-30 深圳清华大学研究院 Knowledge graph construction method and device based on deep learning and storage medium
US20210374349A1 (en) * 2020-09-21 2021-12-02 Beijing Baidu Netcom Science And Technology Co., Ltd. Method for text generation, device and storage medium
CN113887191A (en) * 2021-10-18 2022-01-04 支付宝(杭州)信息技术有限公司 Method and device for detecting similarity of articles
CN114511854A (en) * 2021-12-30 2022-05-17 福建亿能达信息技术股份有限公司 Contract normalization auditing method, device, equipment and medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210209173A1 (en) * 2018-02-08 2021-07-08 Trevor George Thompson Document analysis method and apparatus
CN109117479A (en) * 2018-08-13 2019-01-01 数据地平线(广州)科技有限公司 A kind of financial document intelligent checking method, device and storage medium
KR102009901B1 (en) * 2018-10-30 2019-08-12 삼성에스디에스 주식회사 Method for comparative analysis of document and apparatus for executing the method
CN110502632A (en) * 2019-07-19 2019-11-26 平安科技(深圳)有限公司 Contract terms reviewing method, device, computer equipment and storage medium based on clustering algorithm
CN110852054A (en) * 2019-10-22 2020-02-28 华融融通(北京)科技有限公司 Automatic contract version comparison tool and method in field of undesirable asset management
CN111274782A (en) * 2020-02-25 2020-06-12 平安科技(深圳)有限公司 Text auditing method and device, computer equipment and readable storage medium
US20210374349A1 (en) * 2020-09-21 2021-12-02 Beijing Baidu Netcom Science And Technology Co., Ltd. Method for text generation, device and storage medium
CN112330214A (en) * 2020-11-26 2021-02-05 杭州睿胜软件有限公司 Contract review method and device and readable storage medium
CN112699658A (en) * 2020-12-31 2021-04-23 科大讯飞华南人工智能研究院(广州)有限公司 Text comparison method and related device
CN113569001A (en) * 2021-01-29 2021-10-29 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and computer readable storage medium
CN113722501A (en) * 2021-08-06 2021-11-30 深圳清华大学研究院 Knowledge graph construction method and device based on deep learning and storage medium
CN113887191A (en) * 2021-10-18 2022-01-04 支付宝(杭州)信息技术有限公司 Method and device for detecting similarity of articles
CN114511854A (en) * 2021-12-30 2022-05-17 福建亿能达信息技术股份有限公司 Contract normalization auditing method, device, equipment and medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
宁凌;: "一种基于深度学习的PDM文档自动审核算法", 计算机与网络, no. 10, 26 May 2018 (2018-05-26), pages 60 - 61 *
宋昊苏;李宁;张伟;: "VSM模型在文档结构识别中的应用", 北京信息科技大学学报(自然科学版), vol. 26, no. 06, 15 December 2011 (2011-12-15), pages 66 - 69 *
曹祺 等: "基于Doc2Vec的专利文件相似度检测方法的对比研究", 图书情报工作, vol. 62, no. 13, 31 December 2018 (2018-12-31), pages 74 - 81 *
陈叶旺;钟必能;王靖;李海波;: "一种基于本体与描述文本的网络图像语义标注方法", 计算机科学, vol. 39, no. 1, 15 June 2012 (2012-06-15), pages 293 - 299 *

Similar Documents

Publication Publication Date Title
US11397778B2 (en) Method and device for mining an enterprise relationship
CA3174601C (en) Text intent identifying method, device, computer equipment and storage medium
US11810070B2 (en) Classifying digital documents in multi-document transactions based on embedded dates
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
US20050182736A1 (en) Method and apparatus for determining contract attributes based on language patterns
CN111210842B (en) Voice quality inspection method, device, terminal and computer readable storage medium
CN110427487B (en) Data labeling method and device and storage medium
CN109829629A (en) Generation method, device, computer equipment and the storage medium of risk analysis reports
CN112732897A (en) Document processing method and device, electronic equipment and storage medium
CN116415017B (en) Advertisement sensitive content auditing method and system based on artificial intelligence
CN112434970A (en) Qualification data verification method and device based on intelligent data acquisition
CN112434884A (en) Method and device for establishing supplier classified portrait
CN114462556A (en) Enterprise association industry chain classification method, training method, device, equipment and medium
CN116563006A (en) Service risk early warning method, device, storage medium and device
US9563847B2 (en) Apparatus and method for building and using inference engines based on representations of data that preserve relationships between objects
CN113657773A (en) Method and device for testing speech technology, electronic equipment and storage medium
CN110544467A (en) Voice data auditing method, device, equipment and storage medium
CN115098629A (en) File processing method and device, server and readable storage medium
CN114741501A (en) Public opinion early warning method and device, readable storage medium and electronic equipment
CN114842385A (en) Science and science education video auditing method, device, equipment and medium
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium
CN112133308A (en) Method and device for multi-label classification of voice recognition text
CN109446239A (en) Text method for digging, device and computer readable storage medium under line
CN110879868A (en) Consultant scheme generation method, device, system, electronic equipment and medium
CN111798217B (en) Data analysis system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination