WO2022134805A1 - 文档分类预测方法、装置、计算机设备及存储介质 - Google Patents

文档分类预测方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022134805A1
WO2022134805A1 PCT/CN2021/125227 CN2021125227W WO2022134805A1 WO 2022134805 A1 WO2022134805 A1 WO 2022134805A1 CN 2021125227 W CN2021125227 W CN 2021125227W WO 2022134805 A1 WO2022134805 A1 WO 2022134805A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
sample
training
vector
preset
Prior art date
Application number
PCT/CN2021/125227
Other languages
English (en)
French (fr)
Inventor
刘玉
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134805A1 publication Critical patent/WO2022134805A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present application relates to the technical field of classification models, and in particular, to a document classification prediction method, apparatus, computer equipment and storage medium.
  • document classification models in the prior art generally require a large amount of labeled data for training in order to have considerable classification accuracy, but these document classification models are easily affected by data imbalance, such as training of a certain category. If there is very little data, the classification accuracy of the model in this classification will be low, resulting in low document classification accuracy, and it takes a lot of time to manually label the data, which is not conducive to the deployment and application of the model in various fields.
  • Embodiments of the present application provide a document classification prediction method, apparatus, computer equipment, and storage medium, so as to solve the problem of low document classification accuracy caused by less manual annotation data.
  • a document classification prediction method comprising:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • a document classification prediction device comprising:
  • the prediction request instruction receiving module is used to receive the prediction request instruction including the target document;
  • a document parsing module configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
  • the first vector extraction module is used for inputting the text information and the coordinate information into a preset pre-training language model, and performing vector extraction on the text information and the coordinate information to obtain the corresponding text information and the target document.
  • document representation vector
  • a document vector set acquisition module configured to acquire a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document category determination module configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  • a computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • the above-mentioned document classification prediction method, device, computer equipment and storage medium receives the prediction request instruction containing the target document; through the preset document analysis model, the document analysis is performed on the target document, and the corresponding target document is obtained.
  • text information and coordinate information corresponding to the text information input the text information and the coordinate information into a preset pre-training language model, perform vector extraction on the text information and the coordinate information, and obtain the text information and the coordinate information.
  • the document representation vector corresponding to the target document obtain a sample document vector set; the sample document vector set contains at least one sample document vector; one of the sample document vectors is associated with a document category; document vector distances between document vectors, and the document category corresponding to the target document is determined according to the document vector distances.
  • the present application determines the document category of the target document by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified. During the classification process, the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.
  • FIG. 1 is a schematic diagram of an application environment of a document classification prediction method in an embodiment of the present application
  • FIG. 2 is a flowchart of a document classification prediction method in an embodiment of the present application.
  • step S50 in the document classification prediction method in an embodiment of the present application
  • FIG. 5 is a schematic block diagram of a document classification prediction device in an embodiment of the present application.
  • FIG. 6 is another principle block diagram of a document classification prediction apparatus in an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a document category determination module in a document category prediction device according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device in an embodiment of the present application.
  • the document classification prediction method provided by the embodiment of the present application can be applied in the application environment shown in FIG. 1 .
  • the document classification prediction method is applied in a document classification prediction system.
  • the document classification prediction system includes a client and a server as shown in FIG. 1 , and the client and the server communicate through the network to solve the problem of less manual annotation data. This leads to the problem of low document classification accuracy.
  • the client also known as the client, refers to the program corresponding to the server and providing local services for the client.
  • Clients can be installed on, but not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a document classification prediction method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • the prediction request instruction may be an instruction sent by a preset sender (eg, the author of the target document, or the document manager).
  • the target document refers to a document with a regular title and has not been classified as a document; wherein, the regular title refers to a title with several filled areas, such as a company name area and a year area; the regularity
  • the optional title can be used by document creators to fill in the content that needs to be filled in the filling area, combined with the content of the document. Exemplarily, such as "Rongsheng Petrochemical (company name area): 2020 (year area) semi-annual report" similar style document.
  • S20 Perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
  • the preset document parsing model is used to extract text information and coordinate information of the target document.
  • the target document is a pdf document
  • the preset document parsing model may be based on PyMuPDF (an open source pdf parsing software). Parse the model.
  • Text information refers to the text content of the first five pages in the target document.
  • the coordinate information refers to the page number of each word in the content of the first five pages and the specific position in the corresponding page number.
  • S30 Input the text information and the coordinate information into a preset pre-trained language model, and perform vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
  • the preset pre-trained language model may be a LayoutLM model.
  • the text information and the The coordinate information is input into the pre-trained language model to generate a target word sequence corresponding to the target document according to the text information and the coordinate information, and the target word sequence represents each word in the target document sorted according to the coordinate information; method, determine the target high-order feature corresponding to the target word sequence, and perform an average pooling process on the target high-order feature to obtain a document representation vector.
  • S40 Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with one document category;
  • the sample document vector set is a set of sample document vectors corresponding to each sample document obtained by inputting the sample document into a preset pre-trained language model.
  • all sample documents are input into the preset document parsing model respectively, so as to perform document parsing on each sample document, and obtain the sample text information corresponding to the sample document and the sample text corresponding to the sample text.
  • the sample coordinate information corresponding to the information and then input the sample text information and sample coordinate information into the preset pre-training language model, and perform vector extraction on the text information and coordinate information to obtain the sample document vector corresponding to each sample document.
  • each sample document is acquired, the classification of each sample document can be determined according to the document title associated with the sample document, and then each sample document is classified so that one sample document is associated with one document category.
  • S50 Determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  • the document vector distance between the document representation vector and each of the sample document vectors is determined, and the document category corresponding to the target document is determined according to each of the document vector distances.
  • the sample document vector is also associated with a sample document; the determining the document category corresponding to the target document according to the distance of each document vector includes:
  • S501 Select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;
  • the preset number may be determined according to a specific scenario, and for example, the preset number may be 10, 20, etc.
  • the preset distance threshold can be 0.5, 0.7, etc.
  • a preset number of sample documents whose document vector distance is less than or equal to a preset distance threshold are selected as candidate documents.
  • all sample documents that satisfy the condition that the document vector distance is less than or equal to the preset distance threshold may be used as candidate documents.
  • the document vector distances are all greater than the preset distance threshold, it indicates that the document category currently associated with the sample document cannot characterize the document category of the target document, and then a new document category is established according to the document title of the target document, and the The target document is classified under the new document category, and when the next time a prediction request command containing a new target document is received, if the document vector distance between the document vector of the new target document and the document representation vector of the target document is less than or When it is equal to the preset distance threshold, the document category of the target document can be used as the document category of the new target document, which improves the efficiency of document classification.
  • S502 Obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
  • the candidate documents of the same document category are obtained. For the proportion of all the candidate documents, the document category with the highest proportion is recorded as the document category of the target document.
  • the document category of the target document is determined by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified.
  • the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.
  • the method before the inputting the text information and the coordinate information into the preset pre-trained language model, the method further includes:
  • S01 Acquire a training document triplet;
  • the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;
  • positive sample documents refer to documents with the same document category as the training documents.
  • Negative documents are documents that do not have the same document class as the training document.
  • S02 Input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, positive sample document and negative sample document, respectively, to obtain a first training document corresponding to the training document vector, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;
  • the initial language model may be a LayoutLM model.
  • a detailed explanation of this step can be found in the following examples.
  • the sample document triplet is input into an initial language model including initial parameters, and vector extraction is performed on the training document, the positive sample document and the negative sample document, respectively, to obtain the training document, the positive sample document and the negative sample document.
  • the first training vector corresponding to the document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document including:
  • S011 Extract the word sequences of the training document, the positive sample document and the negative sample document, respectively, to obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document.
  • the word sequence refers to each word in the training document, the positive sample document, and the negative sample document and the corresponding ranking relationship.
  • the obtained training word sequence is: (where a represents the training document, and x is the length of the word sequence of the training document), since in the initial language model it is necessary to distinguish the beginning of a document ([CLS] below) and the end ([SEP] below), So the final training word sequence is In the same way, it is assumed that the obtained positive sample word sequence is (where p represents the positive sample document, y is the word sequence length of the positive sample document), and the final positive sample word sequence is In the same way, it is assumed that the negative sample word sequence obtained is (where n represents the negative sample document, s is the word sequence length of the negative sample document), and the final negative sample word sequence is
  • S012 Determine the training high-order feature corresponding to each word in the training word sequence, the positive sample high-order feature corresponding to each word in the positive sample word sequence, and the negative sample Negative sample high-order features corresponding to each word in the word sequence;
  • the high-level feature representation corresponding to each word in each word sequence can be determined by the following expression:
  • S013 Perform an average pooling process on the training high-order features, the positive sample high-order features, and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector, and the third training vector.
  • the average pooling processing method is used to obtain the first training vector, the second training vector and the third training vector.
  • MEAN_POOLING i ( ) is the average pooling function; i represents the i-th word; S a is the first training vector; Sp is the second training vector; Sn is the third training vector.
  • S03 Determine a total loss value of the language model according to the first training vector, the second training vector and the third training vector.
  • a total loss value of the language model is determined according to the first training vector, the second training vector and the third training vector.
  • step S03 the determining the total loss value of the language model according to the first training vector, the second sample vector and the third training vector includes:
  • the total loss value is determined through a triple loss function.
  • the first document distance and the second document distance are substantially Euclidean distances.
  • the total loss value can be determined according to the following triple loss function:
  • Sa is the first training vector
  • Sp is the second training vector
  • Sn is the third training vector.
  • is the first document distance
  • is the second document distance
  • is a real number, which is taken as 1 in this embodiment.
  • the intuitive meaning of the total loss is that the distance between the positive sample document and the training document is getting closer and the distance between the negative sample document and the training document is getting further and further, thereby improving the document classification accuracy of the model.
  • the convergence condition can be the condition that the total loss value is less than the set threshold, that is, when the total loss value is less than the set threshold, the training is stopped; the convergence condition can also be that the total loss value after 10,000 calculations is The condition is very small and will not decrease, that is, when the total loss value is small and will not decrease after 10,000 calculations, stop training, and record the initial language model after convergence as the preset pre-training language model.
  • the initial language model after determining the total loss value according to the training document, positive sample document and negative sample document in the training document triplet, when the total loss value does not reach the preset convergence condition, adjust the initial language model according to the total loss value. initial parameters, and re-input the training document triplet into the initial language model after adjusting the initial parameters, so as to select another training document when the total loss value corresponding to the training document triplet reaches the preset convergence condition Triples (such as replacing negative sample documents or positive sample documents), and perform steps S01 to S04 to obtain the total loss value corresponding to the training document triples, and when the total loss value does not reach the preset convergence When conditions are met, the initial parameters of the initial language model are adjusted again according to the total loss value, so that the total loss value corresponding to the training document triplet reaches the preset convergence condition.
  • the output results of the initial language model can continue to move closer to accurate results, so that the recognition accuracy is getting higher and higher, until all training document triples correspond to
  • the initial language model after convergence is recorded as the preset pre-trained language model.
  • an adam optimizer may also be used, and the optimizer is based on a parameter update method of gradient descent, and further updates the initial parameters continuously when the total loss value is less than the set threshold condition.
  • the method before acquiring the triplet of the sample document, the method further includes:
  • sample document set includes at least one sample document; one described sample document is associated with a document title;
  • the sample documents in the preset sample document set can be crawled from all pdf documents from major websites by conventional crawling technology, and the crawled information includes the sample documents and the document titles associated with the sample documents.
  • the normalization process for each of the document titles includes:
  • the preset special symbol can be ":". Understandably, although the content of each pdf document is different, the structure of the content is mostly the same. For example, for pdf documents similar to "XXX Company: 2020 Annual Report", the text content before ":” is only limited The report of a certain company, so the preset special symbol and all characters before the preset special symbol should be eliminated and processed without affecting the subsequent document classification.
  • the culling title contains the preset year character and/or the preset number of times character, replace the preset year character with the first preset character, and replace the preset number of times character with the second preset character , which further indicates that the normalization processing of the document title is completed.
  • the preset year character is the character containing the year in the title;
  • the preset number character is the character that represents the frequency style in the title, such as "XXX Company: 2020 X Quarterly Report".
  • the first preset characters and the second preset characters can be replaced by English characters or other special characters.
  • the first preset characters and the second preset characters are used to eliminate the influence of the year and the number of times on the document classification.
  • the title is "Announcement on Holding the Eighth Meeting in 2020", then the 2020 is replaced by X; eight can be replaced by Y, then it will be replaced by the "Announcement on Holding the Yth Meeting in Year X”.
  • document classification is performed on each of the sample documents, that is, according to each document title after the normalization processing.
  • the matching degree between characters is used for document classification, and the documents whose matching degree is higher than the preset threshold are classified into one category, and then the document category corresponding to each sample document is obtained.
  • the preset threshold can be set to 90%, 95% and so on.
  • the top 500 document categories with the most sample documents can be selected, and the remaining document categories are removed to avoid too many document categories and burden the computer system.
  • any document type can be selected from each document type. Select a sample document as a training document, and then select a document from the document category as a positive sample document; then select a document category from other document categories except the selected document category, and then select a document category from the document category. Pick a sample document as a negative sample document.
  • a document classification prediction apparatus is provided, and the document classification prediction apparatus corresponds one-to-one with the document classification prediction method in the above embodiment.
  • the document classification prediction apparatus includes a prediction request instruction receiving module 10 , a document parsing module 20 , a first vector extraction module 30 , a document vector set acquisition module 40 and a document category determination module 50 .
  • the detailed description of each functional module is as follows:
  • a prediction request instruction receiving module 10 configured to receive a prediction request instruction including a target document
  • the document parsing module 20 is configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
  • the first vector extraction module 30 is used for inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a vector corresponding to the target document The document representation vector of ;
  • a document vector set obtaining module 40 configured to obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • the document category determination module 50 is configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  • the document classification prediction device further includes:
  • Document triplet acquisition module 01 used to acquire training document triples;
  • the sample document triples include training documents, positive sample documents corresponding to the training documents, and negative sample documents corresponding to the sample documents;
  • the second vector extraction module 02 is configured to input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, the positive sample document and the negative sample document, respectively, to obtain the The first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;
  • a total loss value determination module 03 configured to determine the total loss value of the language model according to the first training vector, the second training vector and the third training vector;
  • Language model training module 04 configured to update and iterate the initial parameters of the initial language model when the total loss value does not reach the preset convergence condition, until the total loss value reaches the preset convergence condition, The initial language model after convergence is recorded as the preset pre-trained language model.
  • the second vector extraction module includes:
  • a word sequence extraction unit used for extracting the word sequences of the training document, the positive sample document and the negative sample document respectively, to obtain the training word sequence corresponding to the training document and the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document;
  • a high-level feature determination unit configured to determine, by using a preset feature representation method, the training high-level features corresponding to each word in the training word sequence, and the positive sample high-level feature corresponding to each word in the positive sample word sequence, And the negative sample high-order features corresponding to each word in the negative sample word sequence;
  • the average pooling processing unit is used to perform average pooling processing on the training high-order features, positive sample high-order features and negative sample high-order features respectively, to obtain the first training vector, the second training vector and the first training vector.
  • the document classification prediction device further includes:
  • a sample document set acquisition module used for acquiring a preset sample document set;
  • the sample document set includes at least one sample document; one of the sample documents is associated with a document title;
  • a normalization processing module is used for normalizing each of the document titles, and according to each document title after the normalization processing, the document classification is performed on each of the sample documents, and the corresponding sample documents are obtained. the document category;
  • a document category selection module for selecting a document category from each of the document categories as a positive document category; selecting a document category from other document categories except the positive document category as a negative document category;
  • the document selection module is used to select a sample document from the positive document category and record it as the training document; meanwhile, select a sample document other than the training document from the positive document category and record it as the training document Positive sample document; select a sample document from the negative document category and record it as the negative sample document;
  • a triplet building module is configured to construct the training document triplet according to the training document, the positive sample document and the negative sample document.
  • the normalization processing module includes:
  • a special symbol detection unit for detecting whether the document title contains a preset special symbol
  • a character culling unit configured to cull the preset special symbol and all characters before the preset special symbol when the preset special symbol is included in the document title, to obtain the cull title
  • a special character detection unit for detecting whether the culling title contains a preset year character and/or a preset number of times character
  • a character replacement unit configured to replace the preset year character with the first preset character and replace the second preset character with the preset year character and/or the preset number of times character when the culling title contains the preset year character and/or the preset number of times character
  • the preset number of characters further indicates that the normalization processing of the document title is completed.
  • the document category determination module 50 includes:
  • the sample document selection unit 501 is used to select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;
  • the document category determining unit 502 is configured to obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
  • Each module in the above-mentioned document classification prediction apparatus may be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a readable storage medium, an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the readable storage medium.
  • the database of the computer device is used to store the data used in the document classification prediction method in the above embodiment.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions when executed by a processor, implement a document classification prediction method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing the computer readable instructions Implement the following steps when instructing:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • one or more readable storage media are provided that store computer-readable instructions that, when executed by one or more processors, cause the one or more processors to execute follows the steps below:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文档分类预测方法、装置、计算机设备及存储介质。该方法通过接收包含目标文档的预测请求指令(S10);通过预设文档解析模型对目标文档进行文档解析,得到与目标文档对应的文字信息以及与文字信息对应的坐标信息(S20);将文字信息以及坐标信息输入至预设预训练语言模型中,对文字信息以及坐标信息进行向量提取,得到与目标文档对应的文档表示向量(S30);获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个样本文档向量关联一个文档类别(S40);确定文档表示向量与各样本文档向量之间的文档向量距离,并根据各文档向量距离确定目标文档对应的文档类别(S50)。该方法提高了文档分类的效率。

Description

文档分类预测方法、装置、计算机设备及存储介质
本申请要求于2020年12月21日提交中国专利局、申请号为202011521171.0,发明名称为“文档分类预测方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及分类模型技术领域,尤其涉及一种文档分类预测方法、装置、计算机设备及存储介质。
背景技术
目前各个领域中均包括数以万计的pdf文档,例如在学术领域存在pdf论文,在专业领域中存在pdf数据报告等。在越来越多pdf文档产生过后,如何对这些pdf文档进行有效分类并且对新文档进行文档类别预测是一种挑战。
发明人意识到,现有技术中的文档分类模型一般都需要大量的标注数据进行训练,才拥有较为可观的分类精度,但是这些文档分类模型容易受到数据不平衡的影响,例如某种类别的训练数据很少,则模型在这个分类上的分类精度会较低,进而导致文档分类准确率较低,并且人工标注数据需要花费大量的时间,不利于模型在各个领域中进行部署应用。
申请内容
本申请实施例提供一种文档分类预测方法、装置、计算机设备及存储介质,以解决人工标注数据较少导致文档分类准确率较低的问题。
一种文档分类预测方法,包括:
接收包含目标文档的预测请求指令;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
一种文档分类预测装置,包括:
预测请求指令接收模块,用于接收包含目标文档的预测请求指令;
文档解析模块,用于通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
第一向量提取模块,用于将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
文档向量集获取模块,用于获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
文档类别确定模块,用于确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上 运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
接收包含目标文档的预测请求指令;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
接收包含目标文档的预测请求指令;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
上述文档分类预测方法、装置、计算机设备及存储介质,该方法通过接收包含目标文档的预测请求指令;通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
本申请通过引入文档的文字信息以及对应的坐标信息,并根据该文字信息和坐标信息对应的文档表示向量,与样本文档向量之间的文档向量距离确定目标文档的文档类别。如此,在样本文档较少的情况下,依然可以对新的文档进行分类,如遇到与样本文档均不匹配的情况下,可以视为一个新的文档类别,进而在不断对新的文档进行分类的过程中,可以补足各个文档类别下的文档数量,而不需要不断更换预设文档解析模型或者预设预训练语言模型对新的文档进行分类,提高了文档分类的效率以及便捷性。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中文档分类预测方法的一应用环境示意图;
图2是本申请一实施例中文档分类预测方法的一流程图;
图3是本申请一实施例中文档分类预测方法中步骤S50的一流程图;
图4是本申请一实施例中文档分类预测方法的另一流程图;
图5是本申请一实施例中文档分类预测装置的一原理框图;
图6是本申请一实施例中文档分类预测装置的另一原理框图;
图7是本申请一实施例中文档分类预测装置中文档类别确定模块的一原理框图;
图8是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的文档分类预测方法,该文档分类预测方法可应用如图1所示的应用环境中。具体地,该文档分类预测方法应用在文档分类预测***中,该文档分类预测***包括如图1所示的客户端和服务器,客户端与服务器通过网络进行通信,用于解决人工标注数据较少导致文档分类准确率较低的问题。其中,客户端又称为用户端,是指与服务器相对应,为客户提供本地服务的程序。客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备上。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种文档分类预测方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
S10:接收包含目标文档的预测请求指令;
可以理解地,该预测请求指令可以为预设发送方(如目标文档的撰写者,或者文档管理人员)发送的指令。在本实施例中,目标文档指的是具有规律性标题,且暂未进行文档分类的文档;其中,规律性标题指的是存在若干填充区域的标题,如公司名称区域,年份区域;该规律性标题可供文档创建者按照填充区域中需要填入的内容,并结合文档内容进行填充。示例性地,如《荣盛石化(公司名称区域):2020年(年份区域)半年度报告》类似样式的文档。
S20:通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
其中,预设文档解析模型用于提取目标文档的文字信息以及坐标信息,示例性地,当目标文档为pdf文档时,该预设文档解析模型可以为基于PyMuPDF(一个开源的pdf解析软件)的解析模型。文字信息指的是目标文档中前五页的文字内容。坐标信息指的是前五页内容中各字词所处的页码以及对应页码中的具***置。
具体地,通过所述预设文档解析模型,抽取所述目标文档中前五页的文字内容,得到所述文字信息;将所述文字信息中各个字词所属的页码以及处于该页码中的位置信息关联记录为所述坐标信息。可以理解地,由于预设文档解析模型一般仅支持只支持512长度的输入,因此无法将真个pdf所包含的文字作为输入,其次前五页一般都会包含文章的标题,而标题是判断pdf类别的一个重要信息。
S30:将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
其中,预设预训练语言模型可以为LayoutLM模型。
具体地,在通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息之后,将所述文字信息以及所述坐标信息输入至预训练语言模型中,以根据文字信息以及坐标信息生成与该目标文档对应 的目标单词序列,该目标单词序列表征目标文档中各个按照坐标信息排序的单词;进而通过预设特征表示方法,确定该目标单词序列对应的目标高阶特征,并对该目标高阶特征进行平均池化处理,得到文档表示向量。
S40:获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
其中,样本文档向量集是通过将样本文档输入至预设预训练语言模型之后,得到与各样本文档对应的样本文档向量的集合。
可以理解地,预设预训练语言模型训练完成之后,将所有样本文档分别输入至预设文档解析模型中,以对各样本文档进行文档解析,得到与样本文档对应的样本文字信息以及与样本文字信息对应的样本坐标信息;进而将样本文字信息以及样本坐标信息输入至预设预训练语言模型中,对文字信息以及坐标信息进行向量提取,得到与各样本文档对应的样本文档向量。
进一步地,在获取各样本文档之后,可以根据样本文档关联的文档标题确定各样本文档的分类,进而对各个样本文档进行分类,使得一个样本文档关联一个文档类别。
S50:确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
具体地,在获取样本文档向量集之后,确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
在一实施例中,如图3所示,所述样本文档向量还关联一个样本文档;所述根据各所述文档向量距离确定所述目标文档对应的文档类别,包括:
S501:自文档向量距离小于或等于预设距离阈值的所述样本文档中选取预设数量的样本文档,并将被选取的样本文档记录为候选文档;
其中,预设数量可以根据具体场景进行确定,示例性地,该预设数量可以为10个,20个等。预设距离阈值可以为0.5、0.7等
可以理解地,在确定所述文档表示向量与各所述样本文档向量之间的文档向量距离之后,选取预设数量,且文档向量距离小于或等于预设距离阈值的样本文档作为候选文档。在文档向量距离小于或等于预设距离阈值的样本文档的个数不满足预设数量时,则将所有满足文档向量距离小于或等于预设距离阈值这一条件的样本文档作为候选文档即可。
进一步地,若文档向量距离均大于预设距离阈值,则表征当前与样本文档关联的文档类别中无法表征目标文档的文档类别,进而根据目标文档的文档标题建立一个新的文档类别,并将该目标文档分类至该新的文档类别下,待下一次接收包含新的目标文档的预测请求指令时,若新的目标文档的文档向量,与目标文档的文档表示向量之间的文档向量距离小于或等于预设距离阈值时,则可以将目标文档的文档类别作为该新的目标文档的文档类别,提高了文档分类的效率。
S502:获取同一文档类别的候选文档在所有所述候选文档中的占比,将占比最高的文档类别记录为所述目标文档的文档类别。
可以理解地,在自文档向量距离小于或等于预设距离阈值的所述样本文档中选取预设数量的样本文档,并将被选取的样本文档记录为候选文档之后,获取同一文档类别的候选文档在所有所述候选文档中的占比,将占比最高的文档类别记录为所述目标文档的文档类别。
在本实施例中,通过引入文档的文字信息以及对应的坐标信息,并根据该文字信息和坐标信息对应的文档表示向量,与样本文档向量之间的文档向量距离确定目标文档的文档类别。如此,在样本文档较少的情况下,依然可以对新的文档进行分类,如遇到与样本文档均不匹配的情况下,可以视为一个新的文档类别,进而在不断对新的文档进行分类的过程中,可以补足各个文档类别下的文档数量,而不需要不断更换预设文档解析模型或者预 设预训练语言模型对新的文档进行分类,提高了文档分类的效率以及便捷性。
在一实施例中,如图4所示,所述将所述文字信息以及所述坐标信息输入至预设预训练语言模型中之前,还包括:
S01:获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;
其中,正样本文档指的是具有与训练文档相同文档类别的文档。负样本文档指的是不具有与训练文档相同文档类别的文档。
S02:将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;
示例性地,该初始语言模型可以为LayoutLM模型。该步骤的详细解释参见下述实施例。
在一实施例中,所述将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量,包括:
S011:分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序列,以及与所述负样本文档对应的负样本单词序列;
其中,单词序列指的是训练文档、正样本文档以及负样本文档中各字词以及对应的排序关系。示例性地,假设分别提取所述训练文档、正样本文档以及负样本文档的单词序列之后,得到的训练单词序列为
Figure PCTCN2021125227-appb-000001
(其中a代表的是训练文档,x为训练文档的单词序列长度),由于在初始语言模型中需要区分一个文档的开头(下述的[CLS])以及结尾(下述的[SEP]),因此最终的训练单词序列为
Figure PCTCN2021125227-appb-000002
同理,假设得到的正样本单词序列为
Figure PCTCN2021125227-appb-000003
(其中p代表的正样本文档,y为正样本文档的单词序列长度),最终的正样本单词序列为
Figure PCTCN2021125227-appb-000004
同理,假设得到的负样本单词序列为
Figure PCTCN2021125227-appb-000005
(其中n代表的负样本文档,s为负样本文档的单词序列长度),最终的负样本单词序列为
Figure PCTCN2021125227-appb-000006
S012:通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;
具体地,可以通过如下表达式,确定各单词序列中各个单词对应的高阶特征表示:
Figure PCTCN2021125227-appb-000007
Figure PCTCN2021125227-appb-000008
Figure PCTCN2021125227-appb-000009
其中,i表征第i个单词。
Figure PCTCN2021125227-appb-000010
为训练高阶特征;
Figure PCTCN2021125227-appb-000011
为正样本高阶特征;
Figure PCTCN2021125227-appb-000012
为负样本高阶特征。
S013:分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。
具体地,在确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征之后,通过平均池化处理方法,以得到第一训练向量、第二训练向量以及所述第三训练向量。
可选地,可以通过下述表达式确定:
Figure PCTCN2021125227-appb-000013
Figure PCTCN2021125227-appb-000014
Figure PCTCN2021125227-appb-000015
其中,MEAN_POOLING i()为平均池化函数;i表征第i个单词;S a为第一训练向量;S p为第二训练向量;S n为第三训练向量。
S03:根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值。
具体地,在分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量,根据第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值。
在一实施例中,步骤S03中,所述根据所述第一训练向量、第二样本向量以及第三训练向量,确定所述语言模型的总损失值,包括:
确定所述第一训练向量与所述第二样本向量之间的第一文档距离;同时确定所述第一训练向量与所述第三训练向量之间的第二文档距离;
根据所述第一文档距离以及所述第二文档距离,通过三重损失函数确定所述总损失值。
其中,第一文档距离以及第二文档距离的实质均为欧几里得距离。
具体地,可以根据如下三重损失函数确定总损失值:
L=max(||S a-S p||-||S a-S n||+ε,0)
其中,S a为第一训练向量;S p为第二训练向量;S n为第三训练向量。||S a-S p||为第一文档距离;||S a-S n||为第二文档距离;ε为实数,在本实施例中取1。该总损失的直观含义即,使得正样本文档离训练文档的距离越来越近,负样本文档离训练文档的距离越来越远,进而提高模型的文档分类精度。
S04:在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。
可以理解地,该收敛条件可以为总损失值小于设定阈值的条件,也即在总损失值小于 设定阈值时,停止训练;收敛条件还可以为总损失值经过了10000次计算后值为很小且不会再下降的条件,也即总损失值经过10000次计算后值很小且不会下降时,停止训练,并将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。
进一步地,根据训练文档三元组中的训练文档、正样本文档以及负样本文档确定总损失值之后,在总损失值未达到预设的收敛条件时,根据该总损失值调整初始语言模型的初始参数,并将该训练文档三元组重新输入至调整初始参数后的初始语言模型中,以在该训练文档三元组对应的总损失值达到预设的收敛条件时,选取另一个训练文档三元组(如更换其中的负样本文档或者正样本文档),并执行步骤S01至S04,得到与该训练文档三元组对应的总损失值,并在该总损失值未达到预设的收敛条件时,根据该总损失值再次调整初始语言模型的初始参数,使得该训练文档三元组对应的总损失值达到预设的收敛条件。
如此,在通过所有训练文档三元组对初始语言模型进行训练之后,使得初始语言模型输出的结果可以不断向准确的结果靠拢,让识别准确率越来越高,直至所有训练文档三元组对应的总损失值均达到预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。
进一步地,在本实施例中还可以采用adam优化器,该优化器基于梯度下降的参数更新方式,进而在总损失值小于设定阈值的条件时,会不断更新初始参数。
在一实施例中,所述获取样本文档三元组之前,还包括:
(1)获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本文档关联一个文档标题;
其中,该预设样本文档集合中的样本文档可以通过常规的爬虫技术,从各大网站上将所有pdf文档爬取下来,爬取的信息包括样本文档,以及与样本文档关联的文档标题。
(2)对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;
具体地,在一实施例中,所述对各所述文档标题进行归一化处理,包括:
检测所述文档标题中是否包含预设特殊符号;
在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;
其中,该预设特殊符号可以为“:”。可以理解地,虽然每一个pdf文档的内容均不相同,但是内容的结构大多数是一致的,例如《XXX公司:2020年度报告》类似的pdf文档,在“:”之前的文字内容仅仅只是限定某个公司的报告,因此该预设特殊符号以及在预设特殊符号之前的所有字符均应该剔除处理,不影响后续文档分类。
检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;
在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。
可以理解地,预设年份字符即为标题中包含年份的字符;预设次数字符即为标题中涵盖表征次数样式的字符,如《XXX公司:2020年度第X季度报告》。第一预设字符以及第二预设字符可以选用英文字符亦或者其它特殊字符进行代替,第一预设字符以及第二预设字符是为了消除年份以及次数对文档分类的影响。
示例性地,在剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题之后,剔除标题为《关于召开2020年度第八次会议公告》,则可以将其中的2020替换成X;八可以替换成Y,则替换后为《关于召开X年度第Y次会议公告》。
进一步地,在对各文档标题进行归一化处理之后,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,也即根据归一化处理之后的各文档标题中字符之间的匹配度进行文档分类,将匹配度高于预设阈值的文档分为一类,进而得到与各样本文档 对应的文档类别。其中,预设阈值可以设定为90%,95%等。
示例性地,若文档分类的结果中存在众多类别,则可以选取样本文档最多的前500个文档类别,剩余文档类别则进行去除处理,避免文档类别过多,对计算机***造成负担。
(3)自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;
(4)自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;
(5)根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。
可以理解地,在根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别之后,可以从各个文档类别中选取任何一个文档类别中选取一个样本文档作为训练文档,再从该文档类别中选取一个文档作为正样本文档;再从除已选取的文档类别之外的其它文档类别中,选取一个文档类别,再从该文档类别中选取一个样本文档作为负样本文档。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种文档分类预测装置,该文档分类预测装置与上述实施例中文档分类预测方法一一对应。如图5所示,该文档分类预测装置包括预测请求指令接收模块10、文档解析模块20、第一向量提取模块30、文档向量集获取模块40和文档类别确定模块50。各功能模块详细说明如下:
预测请求指令接收模块10,用于接收包含目标文档的预测请求指令;
文档解析模块20,用于通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
第一向量提取模块30,用于将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
文档向量集获取模块40,用于获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
文档类别确定模块50,用于确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
优选地,如图6所示,所述文档分类预测装置还包括:
文档三元组获取模块01,用于获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;
第二向量提取模块02,用于将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;
总损失值确定模块03,用于根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值;
语言模型训练模块04,用于在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。
优选地,所述第二向量提取模块包括:
单词序列提取单元,用于分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序 列,以及与所述负样本文档对应的负样本单词序列;
高阶特征确定单元,用于通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;
平均池化处理单元,用于分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。
优选地,所述文档分类预测装置还包括:
样本文档集合获取模块,用于获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本文档关联一个文档标题;
归一化处理模块,用于对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;
文档类别选取模块,用于自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;
文档选取模块,用于自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;
三元组构建模块,用于根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。
优选地,所述归一化处理模块包括:
特殊符号检测单元,用于检测所述文档标题中是否包含预设特殊符号;
字符剔除单元,用于在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;
特殊字符检测单元,用于检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;
字符替代单元,用于在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。
优选地,如图7所示,文档类别确定模块50包括:
样本文档选取单元501,用于自文档向量距离小于或等于预设距离阈值的所述样本文档中选取预设数量的样本文档,并将被选取的样本文档记录为候选文档;
文档类别确定单元502,用于获取同一文档类别的候选文档在所有所述候选文档中的占比,将占比最高的文档类别记录为所述目标文档的文档类别。
关于文档分类预测装置的具体限定可以参见上文中对于文档分类预测方法的限定,在此不再赘述。上述文档分类预测装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过***总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作***、计算机可读指令和数据库。该内存储器为可读存储介质中的操作***和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储上述实施例中文档分类预测方法所使用到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种文档分类预测方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和 易失性可读存储介质。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
接收包含目标文档的预测请求指令;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
接收包含目标文档的预测请求指令;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或者易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种文档分类预测方法,其中,包括:
    接收包含目标文档的预测请求指令;
    通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
    将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
    获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
    确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
  2. 如权利要求1所述的文档分类预测方法,其中,所述将所述文字信息以及所述坐标信息输入至预设预训练语言模型中之前,还包括:
    获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;
    将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;
    根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值;
    在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。
  3. 如权利要求2所述的文档分类预测方法,其中,所述将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量,包括:
    分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序列,以及与所述负样本文档对应的负样本单词序列;
    通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;
    分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。
  4. 如权利要求2所述的文档分类预测方法,其中,所述根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值,包括:
    确定所述第一训练向量与所述第二训练向量之间的第一文档距离;同时确定所述第一训练向量与所述第三训练向量之间的第二文档距离;
    根据所述第一文档距离以及所述第二文档距离,通过三重损失函数确定所述总损失值。
  5. 如权利要求2所述的文档分类预测方法,其中,所述获取样本文档三元组之前,还包括:
    获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本 文档关联一个文档标题;
    对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;
    自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;
    自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;
    根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。
  6. 如权利要求5所述的文档分类预测方法,其中,所述对各所述文档标题进行归一化处理,包括:
    检测所述文档标题中是否包含预设特殊符号;
    在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;
    检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;
    在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。
  7. 如权利要求1所述的文档分类预测方法,其中,所述样本文档向量还关联一个样本文档;所述根据各所述文档向量距离确定所述目标文档对应的文档类别,包括:
    自文档向量距离小于或等于预设距离阈值的所述样本文档中选取预设数量的样本文档,并将被选取的样本文档记录为候选文档;
    获取同一文档类别的候选文档在所有所述候选文档中的占比,将占比最高的文档类别记录为所述目标文档的文档类别。
  8. 一种文档分类预测装置,其中,包括:
    预测请求指令接收模块,用于接收包含目标文档的预测请求指令;
    文档解析模块,用于通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
    第一向量提取模块,用于将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
    文档向量集获取模块,用于获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
    文档类别确定模块,用于确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
    接收包含目标文档的预测请求指令;
    通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
    将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
    获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
    确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向 量距离确定所述目标文档对应的文档类别。
  10. 如权利要求9所述的计算机设备,其中,所述将所述文字信息以及所述坐标信息输入至预设预训练语言模型中之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;
    将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;
    根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值;
    在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。
  11. 如权利要求10所述的计算机设备,其中,所述将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量,包括:
    分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序列,以及与所述负样本文档对应的负样本单词序列;
    通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;
    分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。
  12. 如权利要求10所述的计算机设备,其中,所述根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值,包括:
    确定所述第一训练向量与所述第二训练向量之间的第一文档距离;同时确定所述第一训练向量与所述第三训练向量之间的第二文档距离;
    根据所述第一文档距离以及所述第二文档距离,通过三重损失函数确定所述总损失值。
  13. 如权利要求10所述的计算机设备,其中,所述获取样本文档三元组之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本文档关联一个文档标题;
    对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;
    自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;
    自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;
    根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。
  14. 如权利要求13所述的计算机设备,其中,所述对各所述文档标题进行归一化处理, 包括:
    检测所述文档标题中是否包含预设特殊符号;
    在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;
    检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;
    在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。
  15. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    接收包含目标文档的预测请求指令;
    通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;
    将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;
    获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;
    确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。
  16. 如权利要求15所述的可读存储介质,其中,所述将所述文字信息以及所述坐标信息输入至预设预训练语言模型中之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;
    将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;
    根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值;
    在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。
  17. 如权利要求16所述的可读存储介质,其中,所述将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量,包括:
    分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序列,以及与所述负样本文档对应的负样本单词序列;
    通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;
    分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。
  18. 如权利要求16所述的可读存储介质,其中,所述根据所述第一训练向量、第二训 练向量以及第三训练向量,确定所述语言模型的总损失值,包括:
    确定所述第一训练向量与所述第二训练向量之间的第一文档距离;同时确定所述第一训练向量与所述第三训练向量之间的第二文档距离;
    根据所述第一文档距离以及所述第二文档距离,通过三重损失函数确定所述总损失值。
  19. 如权利要求16所述的可读存储介质,其中,所述获取样本文档三元组之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本文档关联一个文档标题;
    对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;
    自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;
    自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;
    根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。
  20. 如权利要求19所述的可读存储介质,其中,所述对各所述文档标题进行归一化处理,包括:
    检测所述文档标题中是否包含预设特殊符号;
    在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;
    检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;
    在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。
PCT/CN2021/125227 2020-12-21 2021-10-21 文档分类预测方法、装置、计算机设备及存储介质 WO2022134805A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011521171.0 2020-12-21
CN202011521171.0A CN112699923A (zh) 2020-12-21 2020-12-21 文档分类预测方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022134805A1 true WO2022134805A1 (zh) 2022-06-30

Family

ID=75509652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/125227 WO2022134805A1 (zh) 2020-12-21 2021-10-21 文档分类预测方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112699923A (zh)
WO (1) WO2022134805A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587175A (zh) * 2022-12-08 2023-01-10 阿里巴巴达摩院(杭州)科技有限公司 人机对话及预训练语言模型训练方法、***及电子设备
CN117910980A (zh) * 2024-03-19 2024-04-19 国网山东省电力公司信息通信公司 一种电力档案数据治理方法、***、设备及介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699923A (zh) * 2020-12-21 2021-04-23 深圳壹账通智能科技有限公司 文档分类预测方法、装置、计算机设备及存储介质
CN113505579A (zh) * 2021-06-03 2021-10-15 北京达佳互联信息技术有限公司 文档处理方法、装置、电子设备及存储介质
CN115578739A (zh) * 2022-09-16 2023-01-06 上海来也伯特网络科技有限公司 结合rpa和ai实现ia的分类模型的训练方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090216693A1 (en) * 1999-04-28 2009-08-27 Pal Rujan Classification method and apparatus
CN110298338A (zh) * 2019-06-20 2019-10-01 北京易道博识科技有限公司 一种文档图像分类方法及装置
CN111400499A (zh) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 文档分类模型的训练方法、文档分类方法、装置及设备
CN112016273A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文档目录生成方法、装置、电子设备及可读存储介质
CN112052331A (zh) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 一种处理文本信息的方法及终端
CN112699923A (zh) * 2020-12-21 2021-04-23 深圳壹账通智能科技有限公司 文档分类预测方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090216693A1 (en) * 1999-04-28 2009-08-27 Pal Rujan Classification method and apparatus
CN112052331A (zh) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 一种处理文本信息的方法及终端
CN110298338A (zh) * 2019-06-20 2019-10-01 北京易道博识科技有限公司 一种文档图像分类方法及装置
CN111400499A (zh) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 文档分类模型的训练方法、文档分类方法、装置及设备
CN112016273A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文档目录生成方法、装置、电子设备及可读存储介质
CN112699923A (zh) * 2020-12-21 2021-04-23 深圳壹账通智能科技有限公司 文档分类预测方法、装置、计算机设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587175A (zh) * 2022-12-08 2023-01-10 阿里巴巴达摩院(杭州)科技有限公司 人机对话及预训练语言模型训练方法、***及电子设备
CN115587175B (zh) * 2022-12-08 2023-03-14 阿里巴巴达摩院(杭州)科技有限公司 人机对话及预训练语言模型训练方法、***及电子设备
CN117910980A (zh) * 2024-03-19 2024-04-19 国网山东省电力公司信息通信公司 一种电力档案数据治理方法、***、设备及介质
CN117910980B (zh) * 2024-03-19 2024-06-11 国网山东省电力公司信息通信公司 一种电力档案数据治理方法、***、设备及介质

Also Published As

Publication number Publication date
CN112699923A (zh) 2021-04-23

Similar Documents

Publication Publication Date Title
WO2022134805A1 (zh) 文档分类预测方法、装置、计算机设备及存储介质
WO2022142613A1 (zh) 训练语料扩充方法及装置、意图识别模型训练方法及装置
WO2020147238A1 (zh) 关键词的确定方法、自动评分方法、装置、设备及介质
WO2021042503A1 (zh) 信息分类抽取方法、装置、计算机设备和存储介质
WO2018153265A1 (zh) 关键词提取方法、计算机设备和存储介质
WO2020199591A1 (zh) 文本分类模型训练方法、装置、计算机设备及存储介质
WO2019136993A1 (zh) 文本相似度计算方法、装置、计算机设备和存储介质
CN111444723B (zh) 信息抽取方法、计算机设备和存储介质
CN111666401B (zh) 基于图结构的公文推荐方法、装置、计算机设备及介质
CN112926654B (zh) 预标注模型训练、证件预标注方法、装置、设备及介质
WO2022227162A1 (zh) 问答数据处理方法、装置、计算机设备及存储介质
WO2021169423A1 (zh) 客服录音的质检方法、装置、设备及存储介质
CN110427612B (zh) 基于多语言的实体消歧方法、装置、设备和存储介质
WO2022141864A1 (zh) 对话意图识别模型训练方法、装置、计算机设备及介质
WO2022116436A1 (zh) 长短句文本语义匹配方法、装置、计算机设备及存储介质
CN112380837B (zh) 基于翻译模型的相似句子匹配方法、装置、设备及介质
CN110598210B (zh) 实体识别模型训练、实体识别方法、装置、设备及介质
US20200082213A1 (en) Sample processing method and device
WO2022142108A1 (zh) 面试实体识别模型训练、面试信息实体提取方法及装置
CN113051914A (zh) 一种基于多特征动态画像的企业隐藏标签抽取方法及装置
WO2020132933A1 (zh) 短文本过滤方法、装置、介质及计算机设备
CN112100377A (zh) 文本分类方法、装置、计算机设备和存储介质
CN114187595A (zh) 基于视觉特征和语义特征融合的文档布局识别方法及***
CN113806613B (zh) 训练图像集生成方法、装置、计算机设备及存储介质
CN114266252A (zh) 命名实体识别方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908806

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.10.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21908806

Country of ref document: EP

Kind code of ref document: A1