CN113505579A

CN113505579A - Document processing method and device, electronic equipment and storage medium

Info

Publication number: CN113505579A
Application number: CN202110622442.XA
Authority: CN
Inventors: 胡万丰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-10-15

Abstract

The disclosure shows a document processing method, a document processing device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a first document, and predicting the target document type of the first document according to the title of the first document; determining a degree of duplication between the first document and the second document; and if the repetition degree between the first document and the second document is greater than or equal to a first preset threshold value, combining the first document and the second document. By combining the repeated documents in the target document type, the content of the combined documents is more comprehensive and concentrated, the number of the repeated documents in the target document type is reduced, the query efficiency of a user can be improved, and the user can be helped to obtain the required content more quickly and accurately. In addition, the target document type of the first document is determined according to the title of the first document, and then the repetition degree between the first document and the document of the target document type is calculated, so that the calculation resources can be saved, and the calculation cost can be reduced.

Description

Document processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a document processing method and apparatus, an electronic device, and a storage medium.

Background

In a large team, knowledge documents created by different individuals or teams may be stored together, such as in a file server, for others to review and use. For some common problems, there may be multiple users who have written similar knowledge documents, thereby creating some redundancy of information.

In the related art, when a user searches for a document by a keyword, a returned result is generally a document including the keyword in a title or a text. Searching for documents in this way has a problem of low accuracy, for example, there may be a case where documents related to the searched content are not searched due to a mismatch of keywords, or a case where the returned document does not contain or only contains a part of the required information, or the returned document contains redundant information, and so on.

Disclosure of Invention

The disclosure provides a document processing method, a document processing device, an electronic device and a storage medium, which are used for at least solving the problem of low accuracy of information search in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the present disclosure, there is provided a document processing method, the method comprising:

acquiring a first document;

predicting a target document type of the first document according to the title of the first document;

determining the repetition degree between the first document and a second document, wherein the second document is any document except the first document in the documents of the target document type;

and if the repetition degree between the first document and the second document is greater than or equal to a first preset threshold value, combining the first document and the second document.

In an alternative implementation manner, the step of predicting the target document type of the first document according to the title of the first document includes:

inputting the title of the first document to a document classification model obtained by pre-training to obtain the probability that the first document belongs to each document type, wherein the document classification model is a multi-classification model and is used for obtaining the probability that the input document belongs to each document type according to the title of the input document;

and selecting a target document type of the first document from the document types according to the probability that the first document belongs to each document type.

In an optional implementation manner, before the step of inputting the title of the first document to a document classification model obtained through pre-training and obtaining the probability that the first document belongs to each document type, the method further includes:

obtaining a first sample set, wherein the first sample set comprises a plurality of first sample documents, and the document types of the first sample documents are marked document types;

inputting the title of the first sample document to a first neural network model, and outputting the predicted document type of the first sample document;

calculating a first loss value of the first sample document according to the type of the annotation document and the type of the prediction document;

and optimizing the model parameters of the first neural network model by taking the minimum first loss value as a training target, wherein the document classification model is the first neural network model for completing model parameter optimization.

In an optional implementation manner, the step of selecting a target document type of the first document from the document types according to the probability that the first document belongs to each document type includes:

selecting a document type meeting a preset condition from all document types as a candidate document type, wherein the preset condition is that the document type is not determined as the candidate document type and the probability is maximum;

calculating the similarity between the first document and a third document, wherein the third document is any one of the documents of the candidate document type;

if the similarity is larger than or equal to a second preset threshold, determining the candidate document type as the target document type;

if the similarity is smaller than the second preset threshold, the steps of selecting a document type meeting preset conditions from the document types as a candidate document type and calculating the similarity between the first document and the third document are repeatedly executed until the recalculated similarity is larger than or equal to the second preset threshold.

In an optional implementation manner, the step of calculating the similarity between the first document and the third document includes:

and inputting the content of the first document and the content of the third document to a similarity calculation model obtained by pre-training to obtain the similarity between the first document and the third document, wherein the similarity calculation model is used for calculating the similarity between the two input documents according to the document contents of the two input documents.

In an optional implementation manner, before the step of inputting the content of the first document and the content of the third document to a similarity calculation model obtained through pre-training, the method further includes:

obtaining a second sample set, wherein the second sample set comprises a plurality of first sample document pairs, the first sample document pairs have labeling similarity, the first sample document pairs comprise a second sample document and a third sample document, and the labeling similarity of the first sample document pairs is determined based on the similarity between the second sample document and the third sample document;

inputting the content of the second sample document and the content of the third sample document into a second neural network model to obtain the characteristic information of the second sample document and the characteristic information of the third sample document;

inputting the feature information of the second sample document and the feature information of the third sample document into a third neural network model to obtain the prediction similarity between the second sample document and the third sample document;

calculating a second loss value of the first sample document pair according to the prediction similarity and the labeling similarity;

and optimizing model parameters of the second neural network model and the third neural network model by taking the minimum second loss value as a training target, wherein the similarity calculation model comprises the second neural network model and the third neural network model which are used for completing model parameter optimization.

when the first document comprises first original text and first hyperlink text, calculating a first similarity between the content of the first original text and the content of the third document and a second similarity between the content associated with the first hyperlink text and the content of the third document; taking an average value of the first similarity and the second similarity as a similarity between the first document and a third document;

when the third document comprises a second original text and a second hyperlink text, calculating a third similarity between the content of the second original text and the content of the first document and a fourth similarity between the content associated with the second hyperlink text and the content of the first document; and taking the average value of the third similarity and the fourth similarity as the similarity between the first document and the third document.

In an optional implementation manner, the step of determining the degree of duplication between the first document and the second document includes:

and inputting the content of the first document and the content of the second document into a repetition degree calculation model obtained by pre-training to obtain the repetition degree between the first document and the second document, wherein the repetition degree calculation model is used for calculating the repetition degree between the two input documents according to the document contents of the two input documents.

In an optional implementation manner, before the step of inputting the content of the first document and the content of the second document to a pre-trained repetition degree calculation model, the method further includes:

obtaining a third sample set, wherein the third sample set comprises a plurality of second sample document pairs, the second sample document pairs have an annotation repetition, the second sample document pairs comprise a fourth sample document and a fifth sample document, and the annotation repetition of the second sample document pairs is determined based on the repetition between the fourth sample document and the fifth sample document;

inputting the content of the fourth sample document and the content of the fifth sample document into a fourth neural network model to obtain the feature information of the fourth sample document and the feature information of the fifth sample document;

inputting the feature information of the fourth sample document and the feature information of the fifth sample document into a fifth neural network model to obtain the prediction repetition degree between the fourth sample document and the fifth sample document;

calculating a third loss value of the second sample document pair according to the predicted repetition degree and the labeled repetition degree;

and optimizing the model parameters of the fourth neural network model and the fifth neural network model by taking the minimum third loss value as a training target, wherein the repeatability calculation model comprises the fourth neural network model and the fifth neural network model which are used for completing model parameter optimization.

In an optional implementation manner, the step of merging the first document and the second document includes:

merging the first document and the second document to obtain a merged document;

and carrying out deduplication processing on the merged document.

According to a second aspect of the present disclosure, there is provided a document processing apparatus, the apparatus comprising:

a document acquisition module configured to acquire a first document;

a type prediction module configured to predict a target document type of the first document according to a title of the first document;

a duplication degree calculation module configured to determine a duplication degree between the first document and a second document, the second document being any one of the documents of the target document type except the first document;

the document merging module is configured to merge the first document and the second document if the degree of repetition between the first document and the second document is greater than or equal to a first preset threshold value.

In an alternative implementation, the type prediction module is specifically configured to:

In an alternative implementation, the apparatus further includes a first model training module configured to:

In an optional implementation, the apparatus further comprises a second model training module configured to:

In an alternative implementation, the repetition calculation module is specifically configured to:

In an optional implementation, the apparatus further comprises a third model training module configured to:

In an optional implementation, the document merging module is specifically configured to:

merging the first document and the second document to obtain a merged document;

and carrying out deduplication processing on the merged document.

According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the document processing method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the document processing method of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product, wherein the instructions of the computer program product, when executed by a processor of an electronic device, enable the electronic device to perform the document processing method according to the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the technical scheme of the disclosure provides a document processing method, a document processing device, electronic equipment and a storage medium, wherein a first document is obtained firstly, and a target document type of the first document is predicted according to a title of the first document; then determining the repetition degree between the first document and a second document, wherein the second document is any one of the documents of the target document type except the first document; and if the repetition degree between the first document and the second document is greater than or equal to a first preset threshold value, combining the first document and the second document. According to the technical scheme, the repeated documents in the target document type are combined, so that the content of the combined documents is more comprehensive and concentrated, the number of the repeated documents in the target document type is reduced, the query efficiency of a user can be improved, the query result is more comprehensive and accurate, and the user is helped to obtain the required content more quickly and accurately. In addition, firstly, the target document type of the first document is predicted according to the title of the first document, and then only the repetition degree between the first document and the document of the target document type needs to be calculated.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flowchart illustrating a document processing method according to an exemplary embodiment.

FIG. 2 is a flowchart illustrating a method of obtaining a document classification model according to an exemplary embodiment.

FIG. 3 is a flow diagram illustrating a determination of a type of a target document in accordance with an illustrative embodiment.

FIG. 4 is a flow diagram illustrating a method for obtaining a similarity calculation model according to an exemplary embodiment.

FIG. 5 is a flow diagram illustrating a method for obtaining a repetitiveness computation model in accordance with an exemplary embodiment.

FIG. 6 is a block diagram illustrating the structure of a document processing apparatus according to an exemplary embodiment.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a document processing method according to an exemplary embodiment, and an execution subject of the embodiment may be an electronic device such as a server.

As shown in fig. 1, the method may include the following steps.

In step S11, a first document is acquired.

The execution subject of the present embodiment may be an electronic device such as a server.

The document presentation format related to the present embodiment may be various, for example, it may be a written text format, a diagram format, a dynamic web page format, a table, a slide, etc. The document may include a document title and document content.

In step S12, the target document type of the first document is predicted from the title of the first document.

In particular implementations, there may be multiple implementations of predicting the document type of the first document based on the document title. In an optional implementation manner, a title of a first document may be input to a document classification model obtained through pre-training, and a probability that the first document belongs to each document type is obtained, where the document classification model is a multi-classification model and is used to obtain the probability that the input document belongs to each document type according to the title of the input document; and then selecting a target document type of the first document from the document types according to the probability that the first document belongs to each document type.

The document classification model can be obtained by adopting an unsupervised learning method for training, and the unsupervised learning method is low in cost; the neural network model can be obtained by training through a supervised learning method, for example, the neural network model can be obtained by training based on the title of the sample document and the labeled document type of the sample document. The method for supervising learning can obtain a document classification model with high accuracy. The following embodiments will detail the process of training a document classification model using supervised learning.

The document classification model can be obtained by deep learning model training, for example. The document classification model can more accurately extract the characteristics of the text, and besides the number of words, the document classification model can also extract the space and time characteristics of the words in the text, so that the accuracy of document classification can be improved.

The implementation manner of selecting the target document type of the first document from the document types can be various according to the probability that the first document belongs to each document type. In an alternative implementation, the document type with the highest prediction probability may be determined as the target document type. In another alternative implementation manner, the document type with the largest prediction probability may be determined as a candidate document type, then the similarity between the first document and other documents with the types being the candidate document type is calculated, and when the similarity satisfies a condition, the candidate document type may be determined as the target document type. The following embodiments will describe the latter implementation in detail.

In the embodiment, the document type of the first document is determined according to the document title, the documents are roughly classified, and the repetition degree between two documents is calculated subsequently to judge whether the two documents are the repeated documents, so that the repetition degree between two documents in the same document type can be calculated without calculating the repetition degree between two documents in different document types, and therefore, the number of documents involved in the repetition degree calculation can be reduced, the calculation efficiency is improved, the calculation resources are saved, and the calculation cost is reduced.

In step S13, the degree of duplication between the first document and the second document is determined, the second document being any one of the documents of the target document type other than the first document.

The first document and the second document are of the same document type and are of the target document type. The second document is any one of the documents of which the document type is the target document type except the first document. By calculating the repetition degree between any documents in the same document type, the calculation amount can be reduced, and the efficiency is improved.

In an optional implementation manner, the content of the first document and the content of the second document may be input to a repetition degree calculation model obtained by pre-training, so as to obtain the repetition degree between the first document and the second document, where the repetition degree calculation model is used to calculate the repetition degree between two input documents according to the document contents of the two input documents.

The repetition degree calculation model can be obtained by training the neural network model based on the document contents of the two sample documents and the mark repetition degrees of the two sample documents. The following embodiments will describe the training process of the repeatability calculation model in detail. In this implementation, the repetition degree calculation model may be obtained by deep learning model training, for example. The repetition degree calculation model can more accurately extract the characteristics of the text, and can extract the space and time characteristics of the words in the text besides the number of the words, so that the accuracy of measuring the repetition degree of the text can be improved.

In step S14, if the degree of duplication between the first document and the second document is greater than or equal to a first preset threshold, the first document and the second document are merged.

In a specific implementation, when the degree of repetition between the first document and the second document is greater than or equal to a first preset threshold, it may be determined that the first document and the second document are duplicate documents, and the first document and the second document are merged. Therefore, repeated documents in the target document type can be reduced, and the query efficiency of the user is improved.

In an optional implementation manner, the step S14 may further include: merging the first document and the second document to obtain a merged document; and carrying out deduplication processing on the merged document.

In the implementation mode, the duplicate removal processing is carried out on the combined document, so that the occupied space of the combined document can be reduced, the storage resource is saved, and the user can more efficiently find the required content when carrying out information search.

According to the document processing method provided by the embodiment of the disclosure, the repeated documents in the target document type are combined, so that the content of the combined documents is more comprehensive and concentrated, the repeated documents in each document type are reduced, the query efficiency of a user can be improved, the query result is more comprehensive and accurate, and the user is helped to obtain the required content more quickly and accurately. In addition, compared with the scheme of calculating the repetition degree between the first document and all documents at one time, the embodiment predicts the target document type of the first document first, and then calculates the repetition degree between the first document and the document of the target document type, so that the calculation resources can be saved, and the calculation cost can be reduced.

In order to obtain the document classification model, in an alternative implementation, referring to fig. 2, before the step of inputting the title of the first document into the pre-trained document classification model, steps S21 to S24 may be further included.

Step S21, a first sample set is obtained, where the first sample set includes a plurality of first sample documents, and the document type of the first sample document is an annotation document type.

The document type of the first sample document may be selected according to the usage scenario, the service direction, the function, and other factors, which is not limited in this embodiment. It should be noted that, when selecting sample documents, the number of sample documents corresponding to each document type is as uniform as possible, which can reduce the deviation generated when training the model.

Step S22, inputting the title of the first sample document to the first neural network model, and outputting the predicted document type of the first sample document.

Step S23, calculating a first loss value of the first sample document according to the annotation document type and the prediction document type.

And step S24, optimizing the model parameters of the first neural network model by taking the first loss value minimum as a training target, wherein the document classification model is the first neural network model for completing model parameter optimization.

In specific implementation, iterative optimization can be performed on model parameters of the first neural network model through a back propagation algorithm with the goal of minimizing the first loss value, and finally a document classification model is obtained.

The first neural network model may be a Long-Short Term Memory (LSTM) model. The LSTM model is a deep learning model for processing sequence data. Higher prediction accuracy can be achieved using the LSTM model due to the smaller number of words in the document title.

The first neural network model may also be a Gated Recurrent Unit (GRU) or bert (bidirectional Encoder retrieval from transforms), and the like. The GRU model is low in cost, and the BERT model is high in accuracy. It should be noted that the first neural network model is not limited to the above models, and the present embodiment does not limit the models.

In the concrete implementation, a supervised learning method is adopted to train the first neural network model, after the training is finished, the document classification model obtained by the training can be applied to a test sample, the model accuracy is calculated, if the accuracy is not high, the sample document type in the first sample set can be adjusted, or the hyper-parameters of the first neural network model are adjusted, and the first neural network model is retrained.

Because the document classification model in the implementation mode is obtained by adopting the supervised learning method, the accuracy is higher. In the implementation mode, the document classification model is obtained by deep learning model training, the features of the text can be extracted more accurately, besides the number of words, the space and time features of the words in the text can be extracted, and the accuracy of the document classification model can be further improved.

In an alternative implementation, referring to fig. 3, the step of selecting the target document type of the first document from the document types according to the probability that the first document belongs to each document type may include steps S31 to S34.

In step S31, a document type that satisfies a preset condition is selected from the document types as a candidate document type, where the preset condition is that the document type is not determined as the candidate document type and has the highest probability.

In a specific implementation, the document types may be determined as candidate document types in sequence from large to small according to the probability that the first document belongs to each document type until the target document type is determined.

In step S32, the similarity between the first document and a third document is calculated, the third document being any one of the documents of the candidate document type.

By calculating the similarity between the first document and other documents of the candidate document type, whether the first document belongs to the candidate document type can be verified, and whether the candidate document type can be used as the target document type can be further determined. Only the similarity between the first document and other documents of the candidate document type needs to be calculated, and compared with a scheme of calculating the similarity between the first document and all documents at one time, the method can improve the calculation efficiency, save the calculation resources and reduce the calculation cost.

In an alternative implementation manner, the content of the first document and the content of the third document may be input to a similarity calculation model obtained by pre-training, so as to obtain the similarity between the first document and the third document, and the similarity calculation model is used for calculating the similarity between two input documents according to the document contents of the two input documents. The following embodiments will describe the process of training the similarity calculation model in detail.

The similarity calculation model can be obtained by training the neural network model based on the document contents of the two sample documents and the labeling similarity of the two sample documents. The following embodiments will describe the training process of the similarity calculation model in detail. The similarity calculation model can be obtained by deep learning model training, for example. The similarity calculation model can more accurately extract the features of the text, and can extract the space and time features of the words in the text besides the number of the words, so that the accuracy of measuring the similarity of the text can be improved.

In a specific implementation, when the first document or the third document contains a hyperlink text, a PageRank algorithm may be adopted to determine the similarity between the first document and the third document, so that the accuracy of similarity calculation may be improved.

In an alternative implementation, when the first document comprises a first original text and a first hyperlink text, calculating a first similarity between the content of the first original text and the content of the third document and a second similarity between the content associated with the first hyperlink text and the content of the third document; and taking the average value of the first similarity and the second similarity as the similarity between the first document and the third document.

In an alternative implementation, when the third document includes the second original text and the second hyperlink text, a third similarity between the content of the second original text and the content of the first document and a fourth similarity between the content pointed by the second hyperlink text and the content of the first document may be calculated, respectively; and then averaging the third similarity and the fourth similarity to obtain the similarity between the first document and the third document.

It should be noted that, when there are a plurality of third documents, the similarity between the first document and each of the third documents may be an average value of the similarities between the first document and each of the third documents.

In step S33, if the similarity is greater than or equal to the second preset threshold, the candidate document type is determined as the target document type.

And when the similarity between the first document and the third document is greater than or equal to a second preset threshold value, the first document is indicated to belong to a currently determined candidate document type, and the candidate document type can be used as a target document type.

In step S34, if the similarity is smaller than the second preset threshold, the step of selecting a document type satisfying the preset condition from the document types as a candidate document type and the step of calculating the similarity between the first document and the third document are repeatedly performed until the recalculated similarity is greater than or equal to the second preset threshold.

In a specific implementation, the document type with the largest probability value may be selected as a candidate document type, if the similarity between a third document and the first document in the candidate document type is smaller than a second preset threshold, the document type with the second largest probability value is used as a redetermined candidate document type, and so on, until the similarity obtained through recalculation is greater than or equal to the second preset threshold, the redetermined candidate document type may be used as a target document type.

It should be noted that, if the similarity between the third document and the first document in each document type is smaller than the second preset threshold, the first document may be classified as another document type, and the corresponding target document type is the other document type.

In this implementation manner, the target document type is determined according to the probability that the first document belongs to each document type, and meanwhile, the similarity between the first document and other documents of the target document type is greater than or equal to a second preset threshold, so that the accuracy of the target document type can be improved, and the first document is prevented from being classified into an incorrect document type. The implementation mode firstly determines the candidate document type according to the probability that the first document belongs to each document type, then calculates the similarity between the first document and other documents of the candidate document type, and verifies whether the candidate document type can be used as the target document type. Only the similarity between the first document and other documents of the candidate document type needs to be calculated, and compared with a scheme of calculating the similarity between the first document and all documents at one time, the method can improve the calculation efficiency, save the calculation resources and reduce the calculation cost.

In order to obtain the similarity calculation model, in an alternative implementation manner, referring to fig. 4, the method may specifically include steps S41 to S45.

Step S41, a second sample set is obtained, where the second sample set includes a plurality of first sample document pairs, the first sample document pairs have labeling similarity, the first sample document pairs include a second sample document and a third sample document, and the labeling similarity of the first sample document pairs is determined based on the similarity between the second sample document and the third sample document.

The labeled similarity may be a similarity value between a second sample document and a third sample document included in the first sample document pair, or may be a tag that is determined according to the similarity value and is used to indicate whether the second sample document is similar to the third sample document.

Step S42, inputting the content of the second sample document and the content of the third sample document into the second neural network model, and obtaining the feature information of the second sample document and the feature information of the third sample document.

The second Neural Network model may be a Convolutional Neural Network (CNN) model, specifically, may be a 1-dimensional CNN model, which is not limited in this embodiment.

Step S43, inputting the feature information of the second sample document and the feature information of the third sample document into a third neural network model, and obtaining a predicted similarity between the second sample document and the third sample document.

Step S44, calculating a second loss value of the first sample document pair according to the predicted similarity and the labeled similarity.

And step S45, optimizing the model parameters of the second neural network model and the third neural network model by taking the second loss value as a training target, wherein the similarity calculation model comprises the second neural network model and the third neural network model which are used for completing model parameter optimization.

In specific implementation, the second loss value is the minimum, model parameters of the second neural network model and the third neural network model are subjected to iterative optimization through a back propagation algorithm, and finally a similarity calculation model is obtained.

For less specialized document content, the third neural network model may be a BERT model or the like. For higher-expertise document content, the third neural network model may be an LSTM model.

Because the similarity calculation model in the implementation mode is obtained by training by adopting a supervised learning method, the accuracy of the similarity obtained by adopting the similarity calculation model is higher. In addition, the realization mode is that the similarity calculation model is obtained by adopting deep learning model training, the deep learning model can more accurately extract the characteristics of the text, and besides the number of words, the space and time characteristics of the words in the text can be extracted, so that the accuracy of measuring the similarity of the text can be improved.

In order to obtain the repeatability calculation model, in an alternative implementation, referring to fig. 5, the method may specifically include steps S51 to S55.

Step S51, a third sample set is obtained, where the third sample set includes a plurality of second sample document pairs, the second sample document pairs have an annotation repetition, the second sample document pairs include a fourth sample document and a fifth sample document, and the annotation repetition of the second sample document pairs is determined based on the repetition between the fourth sample document and the fifth sample document.

The marked repetition degree can be a repetition degree value between a fourth sample document and a fifth sample document contained in the first sample document pair, and can also be a tag which is determined according to the repetition degree value and is used for indicating whether the fourth sample document and the fifth sample document are repeated.

Step S52, inputting the content of the fourth sample document and the content of the fifth sample document into the fourth neural network model, and obtaining the feature information of the fourth sample document and the feature information of the fifth sample document.

The fourth Neural Network model may be a Convolutional Neural Network (CNN) model, specifically, a 1-dimensional CNN model, which is not limited in this embodiment.

Step S53, inputting the feature information of the fourth sample document and the feature information of the fifth sample document into the fifth neural network model, and obtaining the prediction repetition degree between the fourth sample document and the fifth sample document.

Step S54, calculating a third loss value of the second sample document pair according to the predicted repetition degree and the labeled repetition degree.

And step S55, optimizing the model parameters of the fourth neural network model and the fifth neural network model by taking the third loss value as a training target, wherein the repeatability calculation model comprises the fourth neural network model and the fifth neural network model which are used for completing the optimization of the model parameters.

In specific implementation, the third loss value is the minimum, and model parameters of the fourth neural network model and the fifth neural network model are iteratively optimized through a back propagation algorithm, so that a repeatability calculation model is finally obtained.

For less specialized document content, the fifth neural network model may be a BERT model or the like. For higher-expertise document content, the fifth neural network model may be an LSTM model.

Because the repeatability calculation model in the implementation mode is obtained by training through a supervised learning method, the repeatability accuracy obtained by calculation through the similarity calculation model is higher. In addition, the realization mode is that the repetition degree calculation model is obtained by adopting deep learning model training, the deep learning model can more accurately extract the characteristics of the text, and besides the number of words, the space and time characteristics of the words in the text can be extracted, so that the accuracy of measuring the repetition degree of the text can be improved.

FIG. 6 is a block diagram illustrating a document processing device according to an exemplary embodiment. Referring to fig. 6, may include:

a document acquisition module 61 configured to acquire a first document;

a type prediction module 62 configured to predict a target document type of the first document according to a title of the first document;

a duplication degree calculation module 63 configured to determine a duplication degree between the first document and a second document, the second document being any one of the documents of the target document type except the first document;

a document merging module 64 configured to merge the first document and the second document if the degree of duplication between the first document and the second document is greater than or equal to a first preset threshold.

merging the first document and the second document to obtain a merged document;

and carrying out deduplication processing on the merged document.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 7 is a block diagram of one type of electronic device 800 shown in the present disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 7, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the document processing method described in any embodiment. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the document processing methods described in any of the embodiments.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the document processing method of any of the embodiments is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising readable program code executable by the processor 820 of the device 800 to perform the document processing method of any of the embodiments. Alternatively, the program code may be stored in a storage medium of the apparatus 800, which may be a non-transitory computer readable storage medium, for example, ROM, Random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

Fig. 8 is a block diagram of one type of electronic device 1900 shown in the present disclosure. For example, the electronic device 1900 may be provided as a server.

Referring to fig. 8, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the document processing method of any of the embodiments.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as Windows Server, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like, stored in memory 1932.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of document processing, the method comprising:

acquiring a first document;

2. The method of claim 1, wherein the step of predicting the type of the target document of the first document based on the title of the first document comprises:

3. The method of claim 2, wherein before the step of inputting the title of the first document into a pre-trained document classification model to obtain the probability that the first document belongs to each document type, the method further comprises:

4. The method according to claim 2, wherein the step of selecting the target document type of the first document from the document types according to the probability that the first document belongs to the document types comprises:

5. The document processing method according to claim 4, wherein the step of calculating the similarity between the first document and the third document comprises:

6. The method of claim 5, wherein before the step of inputting the content of the first document and the content of the third document into a similarity calculation model trained in advance, the method further comprises:

7. A document processing apparatus, characterized in that the apparatus comprises:

a document acquisition module configured to acquire a first document;

8. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the document processing method of any one of claims 1 to 6.

9. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the document processing method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the document processing method according to any one of claims 1 to 6 when executed by a processor.