CN113177109A

CN113177109A - Text weak labeling method, device, equipment and storage medium

Info

Publication number: CN113177109A
Application number: CN202110587694.3A
Authority: CN
Inventors: 黄海龙
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-07-27

Abstract

The invention provides a method, a device, equipment and a storage medium for weakly labeling a text, wherein the method comprises the following steps: acquiring a text to be classified, and extracting label words from the text to be classified; selecting a target sentence containing the tag words from the text to be classified; predicting the probability of replacing the label words in the target sentence by each word in a preset word bank through a prediction model; selecting a first preset number of target words from the preset word bank according to the probability of replacing the label words by each word; detecting the number of coincidences of vocabularies in the preset dictionaries of all categories and the target vocabularies; and taking the target category where the preset dictionary with the superposition number larger than the second preset number is located as the weak label of the text to be classified. The invention has the beneficial effects that: the automatic labeling of the text is realized, the labeling period is shortened, and the investment of human resources is reduced.

Description

Text weak labeling method, device, equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a text weak labeling method, a text weak labeling device, text weak labeling equipment and a text weak labeling storage medium.

Background

The existing text classification based on deep learning usually needs high-quality labeled data with a certain scale, and the high-quality labeled data are manually labeled. However, manual labeling has the problems of high cost and long labeling period, and cannot meet the instant requirements of services. Aiming at the problem of insufficient data, the prior art generally adopts a text enhancement mode to perform text enhancement on the existing marked data and expand the text. The text enhancement approach still requires a significant expenditure of human resources.

Disclosure of Invention

The invention mainly aims to provide a method, a device, equipment and a storage medium for weakly labeling texts, and aims to solve the problems of high cost and long labeling period of manual labeling.

The invention provides a weak labeling method of a text, which comprises the following steps:

acquiring a text to be classified, and extracting label words from the text to be classified;

selecting a target sentence containing the tag words from the text to be classified;

predicting the probability of replacing the label words in the target sentence by each word in a preset word bank through a prediction model;

selecting a first preset number of target words from the preset word bank according to the probability of replacing the label words by each word;

detecting the number of coincidences of vocabularies in the preset dictionaries of all categories and the target vocabularies;

and taking the target category where the preset dictionary with the superposition number larger than the second preset number is located as the weak label of the text to be classified.

The invention also provides a device for weakly labeling the text, which comprises the following components:

the extraction module is used for acquiring a text to be classified and extracting label words from the text to be classified;

the selecting module is used for selecting a target sentence containing the tag words from the text to be classified;

the replacing module is used for predicting the probability of replacing the tag words in the target sentence by each word in a preset word bank through a prediction model;

the selecting module is used for selecting a first preset number of target words from the preset word bank according to the probability of replacing the label words by each word;

the detection module is used for detecting the superposition number of the vocabularies in the preset dictionaries of all categories and the target vocabularies;

and the weak labeling module is used for taking the target category where the preset dictionary corresponding to the overlapped number larger than the second preset number is located as the weak label of the text to be classified.

The invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.

The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any of the above.

The invention has the beneficial effects that: the method comprises the steps of extracting keywords from the text to be classified, selecting a target sentence with the keywords, predicting the probability of replacing label words in a preset word bank, selecting the target vocabulary, obtaining corresponding categories according to the target vocabulary, and weakly labeling the text to be classified according to the categories, so that the text is weakly labeled, the labeling period is shortened, the text is automatically labeled, and the investment of human resources is reduced.

Drawings

FIG. 1 is a flowchart illustrating a method for weakly labeling a text according to an embodiment of the present invention;

FIG. 2 is a block diagram schematically illustrating a weak labeling apparatus for text according to an embodiment of the present invention;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that all directional indicators (such as up, down, left, right, front, back, etc.) in the embodiments of the present invention are only used to explain the relative position relationship between the components, the motion situation, etc. in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly, and the connection may be a direct connection or an indirect connection.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In addition, the descriptions related to "first", "second", etc. in the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a method for weakly labeling a text, including:

s1: acquiring a text to be classified, and extracting label words from the text to be classified;

s2: selecting a target sentence containing the tag words from the text to be classified;

s3: predicting the probability of replacing the label words in the target sentence by each word in a preset word bank through a prediction model;

s4: selecting a first preset number of target words from the preset word bank according to the probability of replacing the label words by each word;

s5: detecting the number of coincidences of vocabularies in the preset dictionaries of all categories and the target vocabularies;

s6: and taking the target category where the preset dictionary with the superposition number larger than the second preset number is located as the weak label of the text to be classified.

As described in step S1, the text to be classified is obtained, and the tag word is extracted from the text to be classified. The method for acquiring the text to be classified may be directly acquiring the text to be classified input by the user, or acquiring the text from a text database, or acquiring the text from equipment. The method for extracting the label words may be to perform word segmentation on the text to be classified through a preset word segmentation tool, and then find out the word with the highest frequency as the label word according to the number of each word to extract.

As described in step S2, the target sentence containing the tag word is selected from the text to be classified. Because the tag words are determined, the corresponding target sentences are directly selected according to the positions of the tag words in the text to be classified.

As described in the above step S3, the probability that each word in the preset lexicon replaces the tag word in the target sentence is predicted by the prediction model. The prediction model is a BERT model, is formed by training a large number of training texts, namely, the front-back coherence of the label words in the sentence is obtained, then the corresponding similar preset word stock is found out, the prediction model can also be a simple category identification model, namely, the category to which the label words belong can be obtained, then the corresponding preset word stock is found out based on the category, and then the label words are replaced by the words in the preset word stock, so that the subsequent detection is convenient.

As described in step S4, a first preset number of target words are selected from the preset lexicon according to the probability of replacing the tag word by each word. According to the probability of replacing the label words by each word, the probability can be obtained based on a BERT model, namely the BERT model is used for calculating the relation between words, for example, an apple replaces a banana in a text to be classified, if the front text is a 'eating' word, the probability of appearing between the 'eating' word and the 'apple' can be used as the probability of replacing the label words, and then according to the probability corresponding to each word, a first preset number of target words, for example, 50 words are selected from small to large in sequence.

As described in the above steps S5-S6, the number of overlapping words in the preset dictionary of each category with the target word is detected, that is, the category words of the corresponding category are stored in advance in each category, and then, according to whether the target word is the same as the category word (that is, the target word exists in the preset dictionary), and the target word exceeds a second preset number, where the second preset number is less than or equal to the first preset number, for example, 20, the category corresponding to the preset dictionary may be considered as the category of the to-be-classified notebook, and the category corresponding to the preset dictionary may be weakly labeled, so as to make further determination in the following.

In one embodiment, the step S1 of extracting the tag words from the text to be classified includes:

s101: inputting the text to be classified into a preset word segmentation tool to obtain each first word and the number corresponding to each first word;

s102: calculating the word frequency of each first vocabulary through a word frequency calculation formula according to the first vocabulary and the number corresponding to the first vocabulary;

s103: according to the formula

Calculating the reverse file frequency corresponding to each first vocabulary; wherein D is the total number of sentences in the text to be classified, { j: t is t_i∈d_jMeans that the text to be classified comprises the first vocabulary t_iIDF represents the inverse file frequency, t_iDenotes the ith first word, d_jFor the jth one having t_iThe sentence of (1);

s104: calculating the weight of each first vocabulary according to the formula W-IDF-TF; wherein TF represents the word frequency;

s105: and selecting the first vocabulary with the maximum weight as the label words.

As described in step S101, the classified text is input into a preset word segmentation tool, so as to obtain each first word and the number corresponding to each first word. The preset word segmentation tool can be any one of jieba, SnowNLP, THULAC and NLPIR, the text to be classified is input, the corresponding first words can be obtained, then statistics is carried out on the first words, and the number corresponding to the first words is obtained.

As described in step S102, the word frequency of each first word may be calculated according to the word frequency calculation formula. Wherein the word frequency calculation formula is

Wherein, f (x)_i) Represents the number corresponding to the ith word, m represents the total number of words of the document to be classified, F (x)_i) Indicating the word frequency corresponding to the ith word.

As described in the above steps S103 to S105, then, the reverse text frequency corresponding to each first vocabulary is calculated according to a formula, where the reverse text frequency is considered to be the capability of the word to distinguish different categories, and generally, it is considered that the smaller the frequency of the text occurring in a word is, the greater the capability of the word to distinguish different categories of text is, that is, the greater the reverse text frequency is, so that by introducing IDF, the weight of each first vocabulary is calculated, and the first vocabulary with the largest weight is selected as the tag word, thereby improving the accuracy of automatically selecting the tag word, and making the text analysis smoother.

In one embodiment, before the step S3 of predicting, by using a prediction model, a probability that each word in a preset word library replaces the tag word in the target sentence, the method further includes:

s201: deleting all labels in the text to be classified to obtain an intermediate classified text;

s202: screening a plurality of similar words from a knowledge base based on the label words;

s203: sequentially replacing the label words in the intermediate classified text with the similar words, and calculating the sentence smoothness after replacing the similar words;

s204: taking the similar vocabulary corresponding to the sentence passing degree larger than the preset passing degree as an attribute vocabulary;

s205: and constructing the preset word bank according to each attribute word.

As described in step S201, all labels in the text to be classified are deleted, because some labels are explanations of nouns or other labels unrelated to the text subject, which may affect the subsequent classification, so that the labels may be deleted first.

As described in the above steps S202 to S203, based on the tag words, a plurality of similar words are screened from the knowledge base, that is, words having similar meaning to the tag words are screened from the database, then the tag words are sequentially replaced according to the selected words, and then the replaced sentence order is calculated, wherein the manner of calculating the sentence order is described in detail later, and is not described herein again.

As described in the above steps S204 to S205, the similar vocabulary corresponding to the sentence order greater than the preset order is used as the attribute vocabulary. The words with the sentence passing degree greater than the preset passing degree are recorded as the attribute words, and then the attribute words are converged into the preset word bank, so that the target words meeting the requirements can be found according to the preset word bank in the following process.

In an embodiment, the step S203 of calculating the sentence smoothness after each similar vocabulary replacement includes:

s2031: acquiring the probability of the appearance of each similar vocabulary and each vocabulary except the label words in the intermediate classified text;

s2032: according to the formula:

calculating to obtain the sentence smoothness, wherein i is greater than 1, i belongs to Z, c represents one of the similar words, w₁，w₂，…，w_t-1，w_t，w_t+1，…，w_mRepresenting the value, w, corresponding to each vocabulary in the text to be classified_tA label word is represented that indicates a label word,

P(c|w_i) Representing said similar words with w_i-1Probability of occurrence of a word, P (c | w)_i-1) Representing said similar words with w_i-1Probability of the words appearing together.

As described in the foregoing steps S2031 to S2032, the calculation of the sentence smoothness is implemented, wherein the probability that each vocabulary appears together in one sentence can be obtained according to the corresponding bert model, so that the following formula can be obtained:

calculating to obtain the sentence smoothness, wherein in the formula, the label words w in the intermediate classification text are already matched_tIs replaced, and P (c | w)_i-1w_i) The probability of the label word and the similar words is included, and the two words do not appear at the same time after the replacement, so that the item containing the label word needs to be subtracted to avoid causing errors, and the calculation of the smoothness is realized.

In an embodiment, the step S4 of selecting the first preset number of target words further includes:

s401: detecting whether the selected target vocabulary and the selected tag words have the same vocabulary;

s402: if yes, deleting the corresponding same vocabulary from the selected target vocabulary;

s403: and after deleting, selecting the vocabulary with the same vocabulary quantity as the deleted vocabulary from the rest vocabularies according to the probability of replacing the label words by each word, and keeping the number of the selected target vocabulary in the first preset number.

As described in the foregoing steps S401 to S403, the target vocabulary is checked, that is, whether the target vocabulary and the tag word have the same vocabulary is detected, and if yes, it indicates that there is a repeated word in the selected target vocabulary or a word identical to the tag word, which may cause an error in the selected target vocabulary, and may also have a certain influence on the final category determination, so that the repeated vocabulary needs to be deleted and continuously selected from the remaining vocabularies, that is, the vocabulary is continuously selected according to the size of probability to supplement the deleted same vocabulary, so as to ensure that the target vocabulary having the first preset number is provided, and the criterion of each text to be classified is consistent, so as to reduce the error.

In an embodiment, after the step S6 of taking the target category where the preset dictionary with the number of coincidences larger than the second preset number is located as the weak label of the text to be classified, the method further includes:

s701: acquiring a contrast text containing all label words in the target category;

s702: according to the similarity calculation formula

Calculating the similarity value of the text to be classified and the comparison file, wherein I represents the text to be classified, R represents the comparison text, cos (I, R) represents the similarity value, and x_iThe number y corresponding to the ith word of the text to be classified is represented_iThe number corresponding to the ith word of the comparison text is represented, n represents the number of words contained in the comparison text and the text to be classified,

representing the weight corresponding to the ith word;

s703: judging whether the similarity value is larger than a preset similarity value or not;

s704: and if the similarity value is greater than the preset similarity value, judging that the weak label is the strong label of the text to be classified.

As described in the foregoing steps S701 to S704, the detection of the weak labels is implemented, that is, the comparison text containing the label word in the target category is obtained, it should be noted that the label word of the comparison text is not necessarily the same as the label word of the text to be classified, which is only to find the text with the same vocabulary, so that the similarity of the two texts is better calculated, because the influence factor of the sequence relationship between the words on the similarity determination is lower, the sequence relationship between the words can be omitted, the calculation is performed according to the number of each word and the corresponding weight, when the calculated similarity value is closer to 1, it indicates that the text to be classified is more similar to the comparison text, and when the calculated similarity value is closer to-1, it indicates that the text to be classified is more dissimilar to the comparison text. Therefore, a preset similarity value can be set, and when the similarity value is greater than the preset similarity value, the category of the text to be classified and the category of the text to be classified are basically consistent, so that the contrast file and the text to be classified can be considered to belong to the same category, and the corresponding weak label is converted into the strong label.

Referring to fig. 2, the present invention further provides a device for weakly labeling a text, including:

the extraction module 10 is configured to acquire a text to be classified and extract a tag word from the text to be classified;

a selecting module 20, configured to select a target sentence containing the tag word from the text to be classified;

a replacing module 30, configured to predict, through a prediction model, a probability that each word in a preset word bank replaces the tag word in the target sentence;

a selecting module 40, configured to select, according to the probability that each word replaces the tag word, a first preset number of target words from the preset word bank;

the detection module 50 is used for detecting the number of coincided vocabularies in the preset dictionaries of all categories and the target vocabularies;

and the weak labeling module 60 is configured to use the target category where the preset dictionary with the number of coincidences larger than the second preset number is located as the weak label of the text to be classified.

In one embodiment, the selecting module 20 includes:

the input sub-module is used for inputting the text to be classified into a preset word segmentation tool to obtain each first word and the number corresponding to each first word;

the first calculation submodule is used for calculating the word frequency of each first vocabulary through a word frequency calculation formula according to the first vocabulary and the number corresponding to the first vocabulary;

a second calculation submodule for calculating according to a formula

a third calculation submodule, configured to calculate a weight of each first vocabulary according to a formula W ═ IDF ═ TF; wherein TF represents the word frequency;

and the selection submodule is used for selecting the first vocabulary with the maximum weight as the label word.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store preset dictionaries of various categories, etc. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program can implement the text weakly labeling method according to any of the above embodiments when executed by a processor.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.

The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for weakly labeling a text as described in any of the above embodiments can be implemented.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware associated with instructions of a computer program, which may be stored on a non-volatile computer-readable storage medium, and when executed, may include processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A method for weakly labeling a text, comprising:

2. The method for weakly labeling texts according to claim 1, wherein the step of extracting label words from the texts to be classified comprises:

inputting the text to be classified into a preset word segmentation tool to obtain each first word and the number corresponding to each first word;

calculating the word frequency of each first vocabulary through a word frequency calculation formula according to the first vocabulary and the number corresponding to the first vocabulary;

according to the formula

calculating the weight of each first vocabulary according to the formula W-IDF-TF; wherein TF represents the word frequency;

and selecting the first vocabulary with the maximum weight as the label words.

3. The method for weakly labeling texts according to claim 1, wherein said step of predicting the probability of each word in a preset lexicon replacing said tag word in said target sentence through a prediction model further comprises:

deleting all labels in the text to be classified to obtain an intermediate classified text;

screening a plurality of similar words from a knowledge base based on the label words;

sequentially replacing the label words in the intermediate classified text with the similar words, and calculating the sentence smoothness after replacing the similar words;

taking the similar vocabulary corresponding to the sentence passing degree larger than the preset passing degree as an attribute vocabulary;

and constructing the preset word bank according to each attribute word.

4. The method for weakly labeling texts according to claim 3, wherein the step of calculating the sentence smoothness after each similar vocabulary replacement comprises:

acquiring the probability of the appearance of each similar vocabulary and each vocabulary except the label words in the intermediate classified text;

according to the formula

5. The method for weakly labeling texts of claim 1, wherein the step of selecting a first preset number of target words further comprises:

detecting whether the selected target vocabulary and the selected tag words have the same vocabulary;

if yes, deleting the corresponding same vocabulary from the selected target vocabulary;

and after deleting, selecting the vocabulary with the same vocabulary quantity as the deleted vocabulary from the rest vocabularies according to the probability of replacing the label words by each word, and keeping the number of the selected target vocabulary in the first preset number.

6. The method for weakly labeling texts according to claim 1, wherein after the step of using the target category in which the preset dictionary with the number of coincidences greater than the second preset number is located as the weak label of the text to be classified, the method further comprises:

acquiring a contrast text containing all label words in the target category;

according to the similarity calculation formula

Calculating the similarity value of the text to be classified and the comparison file, wherein I representsThe text to be classified, R represents contrast text, cos (I, R) represents the similarity value, and x_iThe number y corresponding to the ith word of the text to be classified is represented_iThe number corresponding to the ith word of the comparison text is represented, n represents the number of words contained in the comparison text and the text to be classified,

w_irepresenting the weight corresponding to the ith word;

judging whether the similarity value is larger than a preset similarity value or not;

and if the similarity value is greater than the preset similarity value, judging that the weak label is the strong label of the text to be classified.

7. A device for weakly labeling text, comprising:

8. The apparatus for weakly labeling text according to claim 7, wherein the selection module comprises:

a second calculation submodule for calculating according to a formula

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.