CN113407713B - Corpus mining method and device based on active learning and electronic equipment - Google Patents

Corpus mining method and device based on active learning and electronic equipment Download PDF

Info

Publication number
CN113407713B
CN113407713B CN202011141662.2A CN202011141662A CN113407713B CN 113407713 B CN113407713 B CN 113407713B CN 202011141662 A CN202011141662 A CN 202011141662A CN 113407713 B CN113407713 B CN 113407713B
Authority
CN
China
Prior art keywords
corpus
classification
unlabeled
gram
cold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011141662.2A
Other languages
Chinese (zh)
Other versions
CN113407713A (en
Inventor
习自
赵学敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011141662.2A priority Critical patent/CN113407713B/en
Publication of CN113407713A publication Critical patent/CN113407713A/en
Application granted granted Critical
Publication of CN113407713B publication Critical patent/CN113407713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a corpus mining method and device based on active learning and electronic equipment, and relates to the field of artificial intelligence. The method comprises the following steps: obtaining unlabeled corpus; classifying unlabeled corpus by using at least two corpus classification models trained in advance to obtain a first classification type and classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus; selecting unlabeled corpus with inconsistent first classification types and classification scores meeting preset conditions as corpus to be labeled, and performing secondary classification treatment on the corpus to be labeled to obtain a second classification type of the corpus to be labeled. The technical scheme can be beneficial to widening the coverage of corpus mining and improving the generalization of corpus mining.

Description

Corpus mining method and device based on active learning and electronic equipment
Technical Field
The application relates to the field of artificial intelligence, in particular to a corpus mining method and device based on active learning, electronic equipment and a computer readable storage medium.
Background
With the higher demands of people on the quality of life, numerous intelligent assistants are gradually appearing in our lives, such as a cloud small micro intelligent assistant. The user may query the intelligent assistant for relevant information, etc. by means of voice input, text input, etc. Accurate understanding of user needs is a fundamental premise of providing services by intelligent assistants, and in order to improve the intelligent level of the intelligent assistants, it is sometimes necessary to perform corpus mining on skills involved by the intelligent assistants so as to meet different requirements of different users in different scenes on the intelligent assistants.
At present, the corpus mining method mainly comprises random selection, corpus mining according to keywords, corpus mining of an active learning algorithm based on edge probability and the like. Random selection refers to the step of randomly sampling an unlabeled corpus and then labeling by labeling personnel. The corpus mining according to the keywords needs to design a plurality of keywords according to skills, mine the corpus containing the keywords from unlabeled corpus sets, and deliver the corpus to labeling personnel for labeling. The active learning algorithm based on the edge probability needs to initialize a plurality of starting linguistic data, trains a classification model based on the starting linguistic data, predicts all unlabeled linguistic data by using the classification model to obtain scores of the unlabeled linguistic data, and finally selects the linguistic data with the scores between the threshold edges to be labeled by labeling personnel.
However, the corpus mining method has the following problems: the corpus mining method selected randomly is time-consuming and labor-consuming and has extremely low efficiency; corpus mining according to keywords improves the bowl fern efficiency to a certain extent, but is seriously dependent on the selection of the keywords, so that corpus distribution is easy to incline, or corpus mining with missing some cold doors is not achieved; and for the corpus mining method of the active learning algorithm based on the edge probability, some corpora similar to the starting corpus are easily mined, and coverage related to corpus mining is difficult to expand.
Disclosure of Invention
The purpose of the present application is to solve at least one of the above technical drawbacks, in particular, the technical drawbacks of low corpus mining efficiency and difficulty in expanding coverage related to corpus mining results.
In a first aspect, a corpus mining method based on active learning is provided, including:
obtaining unlabeled corpus;
classifying unlabeled corpus by using at least two corpus classification models trained in advance to obtain a first classification type and classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus;
selecting unlabeled corpus with inconsistent first classification types and classification scores meeting preset conditions as corpus to be labeled, and performing secondary classification treatment on the corpus to be labeled to obtain a second classification type of the corpus to be labeled.
In one possible implementation manner, the corpus mining method based on active learning further includes:
training at least two classifiers based on a pre-configured cold-start corpus serving as a training sample to obtain at least two corpus classification models.
In one possible implementation, the step of training at least two classifiers based on a pre-configured cold-start corpus as a training sample to obtain at least two corpus classification models includes:
Acquiring a pre-configured cold start corpus serving as a training sample;
extracting N-gram text features of the cold starting corpus, and screening the N-gram text features to generate an N-gram dictionary of the cold starting corpus; wherein N is a positive integer which is more than or equal to 1;
recording the corresponding position of the N-gram text feature in the N-gram dictionary as the feature expression of the cold start corpus;
and training at least two classifiers based on the feature expression by adopting an extensible machine learning library to obtain at least two corpus classification models.
In one possible implementation, the step of filtering the N-gram text features to generate an N-gram dictionary of the cold-start corpus includes:
counting the occurrence frequency of N-gram text features of the cold start corpus;
and screening out the N-gram text characteristics with occurrence frequency within a preset frequency range, and obtaining an N-gram dictionary of the cold-start corpus.
In one possible implementation, the step of extracting the N-gram text feature of the cold-start corpus includes:
based on a start identifier and an end identifier which are added to the beginning position and the end position of the cold start corpus in advance, extracting the N-gram text features of the cold start corpus segment by segment according to the preset byte segment length N.
In one possible implementation manner, the step of classifying the unlabeled corpus by using at least two corpus classification models to obtain the first classification type and the classification score output by the at least two corpus classification models includes:
extracting N-gram text features of the unlabeled corpus, and carrying out feature vectorization on the N-gram text features of the unlabeled corpus to obtain feature vectors of the unlabeled corpus;
classifying the unlabeled corpus by utilizing at least two corpus classification models according to the feature vectors of the unlabeled corpus, and obtaining a first classification type and classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
In one possible implementation manner, the step of selecting, as the corpus to be annotated, an unlabeled corpus having a first inconsistent classification type and a classification score meeting a preset condition includes:
adding the classification scores of the unlabeled corpus with inconsistent first classification types, calculating to obtain the total score of the selected unlabeled corpus, and sorting the selected unlabeled corpus in a descending order according to the total score;
and according to the descending order sequencing result, acquiring a plurality of unlabeled corpora with the top sequencing as corpora to be labeled.
In one possible implementation manner, the step of performing secondary classification processing on the corpus to be annotated to obtain the second classification type of the corpus to be annotated includes:
performing secondary classification labeling according to the attribute of the corpus to be labeled to obtain a new labeling corpus;
and taking the result of the secondary classification labeling as a second classification type of the new labeling corpus.
In one possible implementation manner, after the step of determining the second classification type of the corpus to be annotated, the method further includes:
and taking the new labeling corpus and the cold starting corpus as new training samples, inputting the new training samples into at least two classifiers, and returning to execute the step of training the at least two classifiers to obtain at least two corpus classification models.
In a second aspect, there is provided a corpus mining device based on active learning, the device comprising:
the unlabeled corpus acquisition module is used for acquiring unlabeled corpus;
the first classification type obtaining module is used for classifying unlabeled linguistic data by utilizing at least two linguistic data classification models to obtain a first classification type and a classification score which are output by the at least two linguistic data classification models and are used for classifying the unlabeled linguistic data;
the second classification type determining module is used for selecting unlabeled corpus with inconsistent first classification types and classification scores meeting preset conditions as corpus to be labeled, and performing secondary classification processing on the corpus to be labeled to obtain the second classification type of the corpus to be labeled.
In a third aspect, there is provided an electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: an active learning-based corpus mining method is executed.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements an active learning-based corpus mining method.
The beneficial effects that this application provided technical scheme brought are:
classifying the unlabeled corpus by utilizing at least two corpus classification models to obtain a first classification type and classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus; selecting unlabeled linguistic data with inconsistent first classification types and classification scores meeting preset conditions as linguistic data to be labeled, and performing secondary classification treatment on the linguistic data to be labeled to obtain a second classification type of the linguistic data to be labeled, so that the cold-start linguistic data serving as training samples is dug out, and the cold-start linguistic data serving as the training samples is not covered but is related to the expanded linguistic data, thereby being beneficial to widening coverage of linguistic data mining and improving generalization of linguistic data mining.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic diagram of an implementation environment related to an active learning-based corpus mining method according to an embodiment of the present application;
fig. 2 is a flowchart of a corpus mining method based on active learning according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for training a corpus classification model according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an active learning-based corpus mining device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
The application scenario according to the embodiment of the present application is described below.
With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, smart assistants, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and become more and more important.
The corpus mining method and device based on active learning and the electronic equipment are applied to intelligent question-answering applications such as intelligent assistants, intelligent customer service and the like in artificial intelligence. In these intelligent question-answering applications, such as intelligent assistants, it is necessary to understand the user's query correctly in order to better answer the user's query correctly.
In order to better illustrate the technical scheme of the application, a certain application environment to which the active learning-based corpus mining method of the present application can be applied is shown below. Fig. 1 is a schematic diagram of an implementation environment related to an active learning-based corpus mining method provided in an embodiment of the present application, and referring to fig. 1, the implementation environment may include: a terminal 101 and a server 102. The terminal 101 is communicatively connected to the server 102.
An application including an application capable of intelligent question answering may be installed on the terminal 101, and may be, for example, a map navigation application, a social application, a life service application, or the like. The embodiment of the present application does not make a specific limitation on the type of application.
The terminal 101 may be one terminal or a plurality of terminals. The terminal 101 includes at least one of a vehicle-mounted terminal, a smart phone, a smart television, a smart speaker, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert compression standard audio layer 4) player, a palm computer, a notebook computer, and a desktop computer.
The user initiates a query, such as placing a song, to an application installed on the terminal 101. The application program recognizes the intention of the user's query through the terminal 101 or the server 102 according to the query initiated by the user's voice input or text input, invokes the corresponding function from the server 102 and performs data processing, and feeds back the processing result to the terminal 101, where the terminal 101 plays the song through a preset music playing program. Alternatively, the user initiates a query to an application installed on the terminal 101, such as how does the weather of the achievement? The server 102 obtains weather information of the achievement according to the query of the user, and feeds the weather information back to the user in a text display or voice broadcasting mode.
Of course, the technical solution provided in the embodiment of the present application may also be applied to other positioning scenarios, which are not listed here.
Based on the application scenario, corpora of intelligent question-answering applications such as intelligent assistants need to be expanded to learn the corpora so as to cope with queries of different demands of users.
At present, related technologies mostly adopt modes of random selection, corpus mining according to keywords, corpus mining of an active learning algorithm based on edge probability and the like to perform corpus mining so as to expand corpus. Random selection refers to the step of randomly sampling an unlabeled corpus and then labeling by labeling personnel. The corpus mining according to the keywords needs to design a plurality of keywords according to skills, mine the corpus containing the keywords from unlabeled corpus sets, and deliver the corpus to labeling personnel for labeling. For example, assuming we want to mine the corpus of musical skills, the following keywords may be specified: "play", "listen to first", "music", "song", etc., then pick up the corpus comprising the above-mentioned arbitrary key words from the unlabeled corpus. For example, "play a song" and "i want to listen to a music" etc. The active learning algorithm based on the edge probability needs to initialize a plurality of starting linguistic data, trains a classification model based on the starting linguistic data, predicts all unlabeled linguistic data by using the classification model, and finally selects the linguistic data with the score between the threshold edge to be labeled by labeling personnel.
However, the corpus mining method is to mine the corpus similar to or homogeneous with the training corpus from the unlabeled corpus, and cannot meet the generalization of corpus mining, namely mine other corpora which belong to the same skill as the training corpus but are not covered by the training corpus, such as the requirement that the training of the existing corpus is expected to involve playing, downloading, searching and the like of music skills, but not involve the requirement of sharing, mine the corpus which belongs to the music skills from the unlabeled corpus, and supplement the shared corpus into the corresponding corpus.
The corpus mining method and device based on active learning and the electronic equipment provided by the application aim to solve the technical problems in the prior art.
The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The scheme provided by the embodiment of the application relates to a corpus mining method and device based on active learning, and electronic equipment, and also relates to a computer readable storage medium, which is specifically described by the following embodiments:
fig. 2 is a flowchart of an active learning-based corpus mining method according to an embodiment of the present application, where the active learning-based corpus mining method is executed at a server.
Specifically, as shown in fig. 2, the corpus mining method based on active learning may include the following steps:
s210, obtaining unlabeled corpus.
Corpus can be understood as a search sentence of a user, including speech, text, picture input, etc. of the user. The unlabeled corpus can be from the corpus used by the user search history recorded in the user platform, or the corpus used by the user and obtained from the web crawler, and the like.
In this embodiment, unlabeled corpora may be classified, that is, unlabeled corpora may be classified into positive sample corpora meeting the corpus requirement of a certain application and negative sample corpora not meeting the corpus requirement of a certain application, which meet the requirement of a set skill.
S220, classifying unlabeled corpus by using at least two pre-trained corpus classification models to obtain a first classification type and classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
And classifying unlabeled corpus by utilizing at least two corpus classification models obtained through pre-training. In this embodiment, at least two classifiers may be trained based on a pre-configured cold-start corpus as a training sample, to obtain at least two corpus classification models.
Since corpus classification principles of corpus classification models obtained based on training of different classifiers are different, classification results for the same corpus may be different. The unlabeled corpus is arranged into a sequence to obtain an unlabeled corpus sequence, and the unlabeled corpus is input into at least two corpus classification models one by one for classification respectively to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
In this embodiment, the first classification type includes a positive sample corpus and a negative sample corpus. The classification score is used for representing the correlation of the unlabeled corpus and the labeling skill of the cold-starting corpus, and the higher the classification score is, the higher the correlation of the unlabeled corpus and the labeling skill of the cold-starting corpus is, otherwise, the lower the correlation of the unlabeled corpus and the labeling skill of the cold-starting corpus is.
S230, selecting unlabeled corpus with inconsistent first classification types and classification scores meeting preset conditions as corpus to be labeled, and performing secondary classification treatment on the corpus to be labeled to obtain a second classification type of the corpus to be labeled.
For the same unlabeled corpus, if the first classification types output by the corpus classification models are consistent, the unlabeled corpus is more similar to the type of the cold-starting corpus serving as a training sample. And adding the classification scores output by the corpus classification models to obtain the total classification score of the unlabeled corpus, and if the total classification score is higher, indicating that the unlabeled corpus is more similar to the cold-starting corpus serving as a training sample in type.
For a clearer explanation of the present scheme, the following detailed description is given in connection with table 1. Table 1 shows the classification results of unlabeled corpora provided by one embodiment.
Table 1 classification results of unlabeled corpus
Wherein,the first class type representing the corpus classification model output is a positive sample,>the first class type representing the corpus classification model output is a negative sample.
As can be seen from table 1 above, three corpus classification models (SVM classification type, LR classification type and NavieBayes classification type) classify the same corpus with the following three classification results:
The classification types output by the three corpus classification models are unlabeled corpuses which pass through consistently (namely, are regarded as positive sample corpuses), the higher the total classification score is, the more similar to cold-start corpuses which are used as training samples, for example, unlabeled corpuses such as 'putting first music', 'i want to listen to songs', and the like are covered by the cold-start corpuses, and the unlabeled corpuses are not helpful for expanding the corpuses, so that the unlabeled corpuses do not need to be mined.
For unlabeled corpora, which are consistently not passed (i.e., are consistently considered as negative-sample corpora) in all three corpus classification models, the lower the total classification score, the less relevant the set skills. For example, "check weather in tomorrow" is obviously a corpus belonging to the skills of weather, so that this corpus does not belong to the skills of music, and there is no help to expand the corpus, and no mining is required for these unlabeled corpora.
For unlabeled corpus with inconsistent classification types output by the three corpus classification models, the unlabeled corpus is likely to be positive-sample corpus to be mined, which is not covered by the existing cold-starting corpus serving as a training sample, or indistinguishable negative-sample corpus, and whether the unlabeled corpus is expanded corpus needs to be further classified or not.
Based on this, in an embodiment, in step S230, selecting, as the corpus to be annotated, an unlabeled corpus having inconsistent first classification types and classification scores meeting a preset condition may include the following steps:
s2301, adding the classification scores of the unlabeled corpus with inconsistent first classification types, calculating to obtain the total score of the selected unlabeled corpus, and sorting the selected unlabeled corpus in a descending order according to the total score.
In this embodiment, unlabeled corpus with inconsistent first classification types output by at least two corpus classification models is selected, classification scores output by the selected unlabeled corpus classification models are added, a total score of the unlabeled corpus is obtained, and the unlabeled corpus is sorted according to a descending order of the total classification scores. As shown in table 2, table 2 is an unlabeled corpus after being selected and ranked.
Table 2 selected and ranked unlabeled corpus
As can be seen from table 2, some of unlabeled corpora with inconsistent first classification types output by the three selected corpus classification models are positive sample corpora belonging to music skills, for example, "new song mojito around the listening" and some negative sample corpora with extremely high relevance to music skills, such as "download the song of qili xiang". It should be noted that, the positive and negative samples are divided according to a preset standard set by the user, for example, if a certain standard defines "download" as a negative sample, the corpus still belongs to the negative sample even if the corpus is related to music skills.
S2302, according to the descending order sorting result, obtaining a plurality of unlabeled corpora with top sorting as corpora to be labeled.
And after the unlabeled corpuses of the first classification type output by the at least two corpus classification models are sorted according to the descending order of the total classification score, acquiring a plurality of unlabeled corpuses with top ranking as corpuses to be labeled for secondary classification treatment so as to further verify whether the classification types of the unlabeled corpuses belong to positive sample corpuses or negative sample corpuses. The number of the unlabeled corpus which is acquired and ranked at the top can be set according to the actual situation, and the number is 2000.
Further, performing secondary classification labeling according to the attribute of the corpus to be labeled to obtain new labeling corpus, and taking the result of the secondary classification labeling as the second classification type of the new labeling corpus.
In one embodiment, the secondary classification labeling can be performed by manually classifying and labeling, and the corpus to be labeled is subjected to reclassification labeling, in another embodiment, the secondary classification labeling can be performed by using other corpus classification models. And carrying out class II classification labeling according to whether the attribute of the corpus to be labeled meets a preset support rule, if so, the corpus is a positive sample corpus, otherwise, the corpus is a negative sample corpus.
In this embodiment, manual classification labeling is performed on the selected corpus to be labeled with the first classification type inconsistent and the top 2000 of descending order ranking, so as to obtain a second classification type of the corpus to be labeled, where the second classification type may be the same as or different from the result of the first classification type. As shown in table 3, table 3 is a classification result of the corpus to be labeled of the secondary classification labeling.
TABLE 3 Classification results of the corpus to be annotated for the secondary Classification annotation
After secondary classification labeling, M positive sample linguistic data and N negative sample linguistic data are obtained from labels to be labeled, the positive sample linguistic data are imported into a positive sample corpus, the negative sample linguistic data are imported into a negative sample corpus for storage, and a round of corpus mining is completed.
In this embodiment, the second classification type is used as a new classification label of the corpus to be labeled, and the second classification type is labeled on the corpus to be labeled, so as to obtain a new labeled corpus.
According to the corpus mining method based on active learning, unlabeled corpuses are obtained, and classified by utilizing at least two corpus classification models, so that a first classification type and classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpuses are obtained; selecting unlabeled corpuses with inconsistent first classification types and classification scores meeting preset conditions as corpuses to be annotated, carrying out secondary classification processing on the to-be-annotated corpuses to obtain second classification types of the to-be-annotated corpuses, and selecting a plurality of unlabeled corpuses with inconsistent first classification types as the to-be-annotated corpuses according to the classification scores from the unlabeled corpuses output by at least two corpus classification models, wherein the to-be-annotated corpuses are likely to be covered corpuses, thereby being beneficial to expanding positive sample corpuses related to set skills, supplementing positive sample corpuses related to set skills and expanding coverage of corpus mining. Further, selecting the corpus to be annotated, the classification score of which meets the preset condition, to carry out secondary classification treatment, so that the workload of the secondary classification treatment is reduced, and the corpus mining efficiency is improved.
In order to more clearly illustrate the technical solution of the present application, the following is further described with respect to a plurality of steps of the corpus mining method based on active learning.
Fig. 3 is a flowchart of a method for training to obtain a corpus classification model according to an embodiment of the present application, as shown in fig. 3, in an embodiment, the active learning-based corpus mining method further includes the following steps:
200. training at least two classifiers based on a pre-configured cold-start corpus serving as a training sample to obtain at least two corpus classification models.
In one embodiment, the corpus classification model may be trained by:
s2001, acquiring a pre-configured cold start corpus serving as a training sample.
Cold start corpus refers to the corpus obtained when the classifier is initially trained, and is a plurality of corpora initialized for a set skill, such as a music skill. The cold start corpus can be obtained by manual writing or by machine writing. The classifier refers to a classifier for predicting which skill, intention or field the corpus belongs to, which is learned by using a deep learning algorithm, and in this embodiment, the classifier may be a corpus classifier, a semantic classifier, and the like.
For example, 2000 cold-start corpora are initialized to the musical skills by means of manual writing, such as: "play music", "get a sound song", "i want to listen to later"; "Liu Yu certain ice rain", "put and select and control. To ensure the corpus mining effect, the cold-start corpus should cover the intentions possibly involved in the use of the user as much as possible, such as "play", "search singer", "search lyrics", "add collection", "pause", etc.
In this embodiment, a plurality of corpora related to the set skills are used as positive sample corpora, and corpora of the rest skills are used as negative sample corpora. For example, 2000 initialized musical skills are used as positive sample corpus, and the rest skills such as weather, news, film and television are used as negative sample corpus. And inputting cold-starting corpus comprising positive sample corpus and negative sample corpus into at least two classifiers, training the at least two classifiers, and learning the positive sample and negative sample classification types of the corpus to obtain corpus classification models corresponding to the at least two classifiers.
In this embodiment, the classifier includes at least two classifiers such as a support vector machine (Support Vector Machine, SVM) classifier, a logistic regression (Logistic Regression, LR) classifier, and a Naive Bayes (NB) classifier. The classifier can be trained separately by using an application program interface (Application Program Interface, API) provided by a Spark's extensible machine learning library (Machine Learnig lib, MLib) when training the classifier, and default parameters provided by MLib are selected for training when training the classifier.
The SVM classifier, the LR classifier and the NB classifier are all common classifiers, and the classification principle thereof is not described in detail. Of course, in other embodiments, other classifiers may be used to train and learn the cold-start samples as training samples.
S2002, extracting N-gram text features of the cold starting corpus, and screening the N-gram text features to generate an N-gram dictionary of the cold starting corpus.
The basic idea of the N-gram algorithm is that text content is segmented through a sliding window with the size of N to form a byte segment sequence with the length of N, each byte segment becomes a gram, then all the grams are counted in frequency, filtering is carried out according to a preset threshold value to form a key gram list, the list is a feature vector space of the text content, and each gram in the list is a feature vector dimension. N in the N-gram is a positive integer, and the N-gram can be 1-gram, 2-gram, 3-gram and the like.
Since the same N-gram text feature may appear in a different location of the cold-start corpus, for example, the 1-gram text feature "put" may appear at the beginning of the corpus such as "put a song" or at a non-beginning location such as "play music" corpus. In this embodiment, a start identifier "B" and an end identifier "E" are added to the beginning position and the ending position of the cold start corpus, respectively, so that the extracted text features have certain position information.
In one embodiment, extracting N-gram text features of a cold-start corpus includes the following implementations:
based on a start identifier and an end identifier which are added to the beginning position and the end position of the cold start corpus in advance, extracting segment by segment according to the preset byte segment length N to obtain N-gram text features of the cold start corpus; wherein N is a positive integer which is more than or equal to 2.
In this embodiment, a start identifier is added at the beginning position of each cold start corpus in advance, and an end identifier is added at the end position of each cold start corpus. In this embodiment, a start identifier "B" may be added at the beginning position and an end identifier "E" may be added at the end position of each cold start corpus in advance, for example, a start identifier and an end identifier are added to "put one song" in the cold start corpus, so as to obtain "put one song E" in B.
When the start identifier is detected, determining the position of the start identifier as the beginning position of the corpus, or determining the position of the next character of the beginning identifier as the beginning position of the corpus. When the ending mark is detected, determining the position of the ending mark as the ending position of the corpus, or determining the last character position of the ending mark as the ending position of the corpus. For example, when the previous character of the 2-gram text feature "put" is detected as "B", the "put" word in the corpus may be determined to be at the beginning position, and when the next character of the 2-gram text feature "song" is detected as "E", the "song" word in the corpus may be determined to be at the end position.
The 1-gram, 2-gram, and 3-gram text features of the cold-start corpus were extracted, respectively, as shown in Table 4.
TABLE 4N-gram text feature of corpus "put one song
Of course, in other embodiments, the text features of the cold start corpus may also be extracted by using a frequency method, tf-idf (term frequency-inverse text frequency index) feature, a mutual information method, N-Gram, word2Vec, and the like.
In one embodiment, the filtering the N-gram text features in step S2102 to generate an N-gram dictionary of the cold-start corpus may include the following steps:
s2002-a, counting occurrence frequency of N-gram text features of the cold start corpus.
And extracting the corresponding N-gram text characteristics of each piece of cold starting corpus serving as a training sample to obtain an N-gram text characteristic set of the cold starting corpus.
Counting the occurrence frequency of each N-gram text feature according to the N-gram text feature set, wherein the occurrence frequency of the 1-gram text feature 'put' is 100 times, the occurrence frequency of the 1-gram text feature 'song' is 500 times, the occurrence frequency of the 2-gram text feature 'first' is 100 times, the occurrence frequency of the 3-gram text feature 'first' is 50 times, and the like.
S2002-b, screening out N-gram text features with occurrence frequencies within a preset frequency range, and obtaining an N-gram dictionary corresponding to the cold start corpus.
In this embodiment, the N-gram text features with occurrence frequencies lower than a first preset threshold and higher than a second preset threshold are filtered, and the N-gram text features with occurrence frequencies in a preset frequency range are screened out, so as to obtain an N-gram dictionary of the cold start corpus. The N-gram dictionary, also known as the N-gram model core dictionary, is a collection of core N-gram text features that indicate how often occurrences are within a preset frequency range.
And S2003, recording the corresponding position of the N-gram text feature in the N-gram dictionary as the feature expression of the cold start corpus.
The N-gram dictionary may be represented in the form of an array, with each element in the array representing each N-gram text feature in the N-gram dictionary. The position of each element in the N-gram dictionary may be represented by an index value, e.g., the N-gram dictionary is a 3×3 array, and the position of each N-gram text feature of the N-gram dictionary is represented by index values 1, 2, 3, 4, 5, 6, 7, 8, 9, respectively.
In this embodiment, each N-gram text feature of each cold-start corpus is obtained, and the position of each N-gram text feature in the N-gram dictionary is determined. Note that if a certain N-gram text feature of the cold start corpus does not correspond to a corresponding text feature in the N-gram dictionary, the N-gram text feature of the cold start corpus is discarded, and no record is needed.
For example, the 1-gram text feature of the cold-starting corpus is "put", the 1 st position of the N-gram dictionary is corresponding to the 1 st position of the N-gram dictionary, the number "1" is used for representing the position, the 3 rd position of the N-gram dictionary is corresponding to the 3 rd position of the N-gram dictionary, the number "3" is used for representing the position, the 2-gram text feature of the cold-starting corpus is "put first", the 57 th position of the N-gram dictionary is corresponding to the 57 th position of the N-gram dictionary, the corresponding position of the N-gram text feature of each cold-starting corpus in the N-gram dictionary is recorded by analogy, and the feature expression of each cold-starting corpus is obtained, wherein the feature expression can be represented by an ordered array, such as a one-dimensional array, and each digital element in the one-dimensional array represents the corresponding position index value of the N-gram text feature of the cold-starting corpus in the N-gram dictionary.
And S2004, training at least two classifiers by adopting an expandable machine learning library based on the feature expression to obtain at least two corpus classification models.
At least two classifiers are respectively connected to an application program interface (Application Program Interface, API) provided by a Spark expandable machine learning library (Machine Learning Library, MLib), corresponding parameters, functions and the like are called, feature expressions of cold-start corpus are input into the at least two classifiers for training to obtain at least two corpus classification models, and if the trained classifiers are a support vector machine (Support Vector Machine, SVM) classifier, a logistic regression (Logistic Regression, LR) classifier and a Naive Bayes (NB) classifier, the trained corpus classification models are respectively an SVM corpus classification model, an LR corpus classification model and an NB corpus classification model.
In an embodiment, in step S220, the step of classifying the unlabeled corpus by using at least two corpus classification models to obtain the first classification type and the classification score output by the at least two corpus classification models may include the following steps:
s2201, extracting N-gram text features of unlabeled corpus, and carrying out feature vectorization on the N-gram text features of unlabeled corpus to obtain feature vectors of unlabeled corpus.
In this embodiment, N in the N-gram is a positive integer, and the N-gram may be 1-gram, 2-gram, 3-gram, or the like. And respectively extracting the text characteristics of the 1-gram, the 2-gram and the 3-gram of the unlabeled corpus. And respectively adding a start identifier 'B' and an end identifier 'E' at the beginning position and the end position of the unlabeled corpus, so that the extracted text features have certain position information.
Extracting N-gram text features of unlabeled corpus, determining the positions of the N-gram text features in an N-gram dictionary, carrying out feature vectorization on the N-gram text features of the unlabeled corpus to obtain feature vectors of the unlabeled corpus, and obtaining the N-gram feature vectors of the unlabeled corpus.
For example, the unlabeled corpus is "music playing", and the extracted N-gram text features are respectively: 1-gram text feature: put, first, music; 2-gram text feature: b, putting the head, the first sound, music and music E;3-gram text feature: b, playing the first sound, playing the first music and playing the music E.
Determining the positions of the text features of each N-gram of the unlabeled corpus 'putting the first music' in the N-gram dictionary, namely corresponding index values, as shown in table 5:
table 5N-gram text features and locations of unlabeled corpus "put first music
And carrying out feature vectorization on the position of the N-gram text feature in the N-gram dictionary to obtain a feature vector of the unlabeled corpus, namely [1,3,25,26,49,57,98,109,125,198,247,305,313].
S2202, classifying the unlabeled corpus by utilizing at least two corpus classification models according to the feature vectors of the unlabeled corpus, and obtaining a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
And inputting the feature vectors of the unlabeled corpus into at least two corpus classification models to classify the unlabeled corpus, and obtaining a first classification type and classification score of the unlabeled corpus output by the at least two corpus classification models. For example, unlabeled corpus "put first music" is input into an SVM corpus classification model, an LR corpus classification model and an NB corpus classification model, and the output classification scores are respectively: 0.98, 0.95 and 0.96, the output first classification types are +, + and +, where "+" indicates that the first classification type is a positive sample and "-" indicates that the first classification type is a negative sample.
In an embodiment, after determining the second classification type of the corpus to be annotated in step S230, the method further includes the following steps:
s240, inputting the new labeling corpus and the cold starting corpus into at least two classifiers as new training samples, and returning to execute the step of training the at least two classifiers to obtain at least two corpus classification models.
In this embodiment, a new labeling corpus obtained by mining a previous round of corpus and a corpus used as a training sample before, for example, a cold start corpus is input into at least two classifiers as a new training sample, the classifiers are retrained, the step of training the at least two classifiers in the step S200 to obtain at least two corpus classification models is returned to be executed, at least two updated corpus classification models are obtained, and the steps of steps S210 to S230 are repeatedly executed for a next set of unlabeled corpus by using the updated corpus classification models, so that the next round of corpus mining is completed. And by analogy, multiple iterations are performed, and the efficiency and accuracy of corpus mining are improved.
The above examples are only used to assist in explaining the technical solutions of the present disclosure, and the illustrations and specific procedures related thereto do not constitute limitations on the usage scenarios of the technical solutions of the present disclosure.
Related embodiments of active learning-based corpus mining apparatus are described in detail below.
Fig. 4 is a schematic structural diagram of an active learning-based corpus mining device according to an embodiment of the present application, as shown in fig. 4, the active learning-based corpus mining device 200 may include: an unlabeled corpus acquisition module 210, a first classification type obtaining module 220, and a second classification type determining module, wherein:
an unlabeled corpus acquisition module 210, configured to acquire unlabeled corpus;
the first classification type obtaining module 220 is configured to classify the unlabeled corpus by using at least two pre-trained corpus classification models, so as to obtain a first classification type and a classification score, which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus;
the second classification type determining module 230 is configured to select, as the corpus to be annotated, an unlabeled corpus that has inconsistent first classification types and classification scores that meet a preset condition, and perform a secondary classification process on the corpus to be annotated, to obtain a second classification type of the corpus to be annotated.
According to the corpus mining device based on active learning, the first classification type which is output by at least two corpus classification models trained by the pre-configured cold-starting corpus and used for classifying unlabeled corpuses is selected, a plurality of unlabeled corpuses with inconsistent first split types are selected as corpuses to be labeled according to classification scores, and the corpuses to be labeled are likely to be covered corpuses, so that the development of positive sample corpuses related to set skills, the supplementation of positive sample corpuses related to set skills and the expansion of coverage of corpus mining are facilitated. Further, selecting the corpus to be annotated, the classification score of which meets the preset condition, to carry out secondary classification treatment, so that the workload of the secondary classification treatment is reduced, and the corpus mining efficiency is improved.
In one possible implementation, the active learning-based corpus mining apparatus 200 may further include: the corpus classification model training module is used for training at least two classifiers based on a pre-configured cold-start corpus serving as a training sample to obtain at least two corpus classification models.
The corpus classification model training module comprises: the device comprises a cold start corpus acquisition unit, an N-gram dictionary generation unit, a feature expression obtaining unit and a corpus classification model obtaining unit;
the cold start corpus acquisition unit is used for acquiring a pre-configured cold start corpus serving as a training sample; the N-gram dictionary generating unit is used for extracting N-gram text features of the cold starting corpus and screening the N-gram text features to generate an N-gram dictionary of the cold starting corpus; the feature expression obtaining unit is used for recording the corresponding position of the N-gram text feature in the N-gram dictionary as the feature expression of the cold start corpus; the corpus classification model obtaining unit is used for training at least two classifiers respectively by adopting an extensible machine learning base based on the feature expression to obtain at least two corpus classification models.
In one possible implementation, the N-gram dictionary generating unit includes: the occurrence frequency statistics subunit and the N-gram dictionary obtaining subunit;
The occurrence frequency statistics subunit is used for counting the occurrence frequency of the N-gram text characteristics of the cold start corpus; the N-gram dictionary obtaining subunit is used for screening out N-gram text characteristics with occurrence frequency in a preset frequency range to obtain an N-gram dictionary of cold start corpus.
In one possible implementation, the N-gram dictionary generating unit includes: the text feature extraction unit is used for extracting N-gram text features of the cold start corpus segment by segment according to the preset byte segment length N based on a start identifier and an end identifier which are added to the beginning position and the end position of the cold start corpus in advance.
In one possible implementation, the first classification type obtaining module 220 includes: the device comprises a feature vector obtaining unit and a first classification type obtaining unit;
the feature vector obtaining unit is used for extracting N-gram text features of the unlabeled corpus and carrying out feature vectorization on the N-gram text features of the unlabeled corpus to obtain feature vectors of the unlabeled corpus; the first classification type obtaining unit is used for classifying the unlabeled corpus by utilizing at least two corpus classification models according to the feature vectors of the unlabeled corpus, and obtaining a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
In one possible implementation, the second classification type determination module 230 includes: the system comprises an unlabeled corpus ordering unit and a corpus obtaining unit to be labeled;
the unlabeled corpus sorting unit is used for adding the classification scores of the unlabeled corpuses with inconsistent first classification types, calculating to obtain the total score of the selected unlabeled corpuses, and sorting the selected unlabeled corpuses in a descending order according to the total score; the corpus to be annotated is obtained by the corpus obtaining unit, and is used for obtaining a plurality of unlabeled corpuses with the top sequence as the corpus to be annotated according to the descending sequence sequencing result.
In one possible implementation, the second classification type determination module 230 includes: a new annotation corpus obtaining unit and a second classification type obtaining unit;
the new labeling corpus obtaining unit is used for carrying out secondary classification labeling according to the attribute of the corpus to be labeled to obtain the new labeling corpus; and the second classification type obtaining unit is used for taking the result of the secondary classification labeling as the second classification type of the new labeling corpus.
In one possible implementation, the corpus mining device 200 further includes: and the return module is used for inputting the new labeling corpus and the cold starting corpus into at least two classifiers as new training samples, and returning to execute the step of training the at least two classifiers to obtain at least two corpus classification models.
The corpus device based on active learning in this embodiment may execute the corpus method based on active learning shown in the foregoing embodiment of the present application, and its implementation principle is similar, and will not be described herein again.
An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor, which, when executed by the processor, performs: the generalization of corpus mining is improved, the coverage of the corpus is widened, and the corpus mining efficiency is improved.
In an alternative embodiment, there is provided an electronic device, as shown in fig. 5, the electronic device 4000 shown in fig. 5 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 5, but not only one bus or one type of bus.
Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 4003 is used for storing application program codes for executing the present application, and execution is controlled by the processor 4001. The processor 4001 is configured to execute application program codes stored in the memory 4003 to realize what is shown in the foregoing method embodiment.
Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
The present application provides a computer readable storage medium having a computer program stored thereon, which when run on a computer, causes the computer to perform the corresponding method embodiments described above. Compared with the prior art, the corpus mining method and device have the advantages that generalization of corpus mining can be improved, coverage of the corpus is widened, and corpus mining efficiency is improved.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A computer device, such as a processor of an electronic device, reads the computer instructions from a computer-readable storage medium, and the processor executes the computer instructions such that the computer device, when executed, performs the following:
Obtaining unlabeled corpus;
classifying unlabeled corpus by using at least two corpus classification models trained in advance to obtain a first classification type and classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus;
selecting unlabeled corpus with inconsistent first classification types and classification scores meeting preset conditions as corpus to be labeled, and performing secondary classification treatment on the corpus to be labeled to obtain a second classification type of the corpus to be labeled.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above-described embodiments.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited to the module itself in some cases, for example, the unlabeled corpus obtaining module may also be described as "a module for obtaining unlabeled corpus".
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations should and are intended to be comprehended within the scope of the present invention.

Claims (10)

1. The corpus mining method based on active learning is characterized by comprising the following steps of:
obtaining unlabeled corpus;
Classifying the unlabeled corpus by using at least two corpus classification models trained in advance to obtain a first classification type and a classification score which are output by at least two corpus classification models and are used for classifying the unlabeled corpus; the first classification type includes a positive sample type and a negative sample type; the classification score is used for representing the correlation of the corresponding unlabeled corpus and the labeled set skills of the cold-starting corpus;
selecting the unlabeled corpus with inconsistent first classification types and classification scores meeting preset conditions as the corpus to be labeled, wherein the method comprises the following steps: aiming at unlabeled corpus with inconsistent first classification types output by at least two corpus classification models, obtaining total score based on classification scores output by each corpus classification model, and if the total score meets a preset condition, determining the unlabeled corpus as the corpus to be labeled;
determining whether the corpus to be annotated corresponds to the set skill or not based on the attribute of the corpus to be annotated so as to carry out secondary classification processing on the corpus to be annotated, obtaining a second classification type of the corpus to be annotated, and annotating the corpus to be annotated based on the second classification type; the second classification type includes a positive sample type and a negative sample type.
2. The active learning-based corpus mining method according to claim 1, further comprising:
training at least two classifiers based on a pre-configured cold-start corpus serving as a training sample to obtain at least two corpus classification models.
3. The active learning-based corpus mining method according to claim 2, wherein the step of training at least two classifiers based on a pre-configured cold-start corpus as a training sample to obtain at least two corpus classification models comprises:
acquiring a pre-configured cold start corpus serving as a training sample;
extracting N-gram text features of the cold starting corpus, and screening the N-gram text features to generate an N-gram dictionary of the cold starting corpus; wherein N is a positive integer which is more than or equal to 1;
recording the corresponding position of the N-gram text feature in the N-gram dictionary as the feature expression of the cold start corpus;
and training at least two classifiers based on the feature expression by adopting an extensible machine learning library to obtain at least two corpus classification models.
4. The active learning-based corpus mining method according to claim 3, wherein the step of screening the N-gram text features to generate the N-gram dictionary of the cold-start corpus comprises:
Counting the occurrence frequency of the N-gram text features of the cold start corpus;
and screening out the N-gram text characteristics of the occurrence frequency in a preset frequency range, and obtaining the N-gram dictionary of the cold start corpus.
5. The active learning-based corpus mining method according to claim 3, wherein the step of extracting N-gram text features of the cold-start corpus comprises:
and extracting the N-gram text characteristics of the cold starting corpus segment by segment according to the preset byte segment length N based on a start identifier and an end identifier which are added to the beginning position and the end position of the cold starting corpus in advance.
6. The active learning-based corpus mining method according to claim 1, wherein the step of classifying the unlabeled corpus by using at least two pre-trained corpus classification models to obtain a first classification type and a classification score output by at least two corpus classification models comprises:
extracting N-gram text features of the unlabeled corpus, and carrying out feature vectorization on the N-gram text features of the unlabeled corpus to obtain feature vectors of the unlabeled corpus;
and classifying the unlabeled corpus by utilizing at least two corpus classification models according to the feature vectors of the unlabeled corpus, so as to obtain a first classification type and classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
7. The active learning-based corpus mining method according to claim 1, wherein the step of selecting, as the corpus to be annotated, an unlabeled corpus having the first classification type inconsistent and the classification score meeting a preset condition includes:
adding the classification scores of the unlabeled corpus with inconsistent first classification types, calculating to obtain the total score of the selected unlabeled corpus, and sorting the selected unlabeled corpus in a descending order according to the total score;
and according to the descending order sorting result, acquiring a plurality of unlabeled corpora with the top sorting as corpora to be labeled.
8. The active learning-based corpus mining method according to claim 1, wherein the step of performing a secondary classification process on the corpus to be annotated to obtain a second classification type of the corpus to be annotated comprises:
performing secondary classification labeling according to the attribute of the corpus to be labeled to obtain a new labeling corpus;
and taking the result of the secondary classification labeling as a second classification type of the new labeling corpus.
9. The active learning-based corpus mining method according to claim 8, wherein after the step of obtaining the second classification type of the corpus to be annotated, the method further comprises:
And taking the new labeling corpus and the cold starting corpus as new training samples, inputting the new labeling corpus and the cold starting corpus into at least two classifiers, and returning to execute the step of training the at least two classifiers to obtain at least two corpus classification models.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the active learning based corpus mining method of any of claims 1-9.
CN202011141662.2A 2020-10-22 2020-10-22 Corpus mining method and device based on active learning and electronic equipment Active CN113407713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011141662.2A CN113407713B (en) 2020-10-22 2020-10-22 Corpus mining method and device based on active learning and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011141662.2A CN113407713B (en) 2020-10-22 2020-10-22 Corpus mining method and device based on active learning and electronic equipment

Publications (2)

Publication Number Publication Date
CN113407713A CN113407713A (en) 2021-09-17
CN113407713B true CN113407713B (en) 2024-04-05

Family

ID=77677366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011141662.2A Active CN113407713B (en) 2020-10-22 2020-10-22 Corpus mining method and device based on active learning and electronic equipment

Country Status (1)

Country Link
CN (1) CN113407713B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559181A (en) * 2013-11-14 2014-02-05 苏州大学 Establishment method and system for bilingual semantic relation classification model
WO2015003143A2 (en) * 2013-07-03 2015-01-08 Thomson Reuters Global Resources Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107644235A (en) * 2017-10-24 2018-01-30 广西师范大学 Image automatic annotation method based on semi-supervised learning
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium
CN110008990A (en) * 2019-02-22 2019-07-12 上海拉扎斯信息科技有限公司 More classification methods and device, electronic equipment and storage medium
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN110209809A (en) * 2018-08-27 2019-09-06 腾讯科技(深圳)有限公司 Text Clustering Method and device, storage medium and electronic device
CN110298032A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Text classification corpus labeling training system
CN110751953A (en) * 2019-12-24 2020-02-04 北京中鼎高科自动化技术有限公司 Intelligent voice interaction system for die-cutting machine
CN111177374A (en) * 2019-12-13 2020-05-19 航天信息股份有限公司 Active learning-based question and answer corpus emotion classification method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611375A (en) * 2015-10-22 2017-05-03 北京大学 Text analysis-based credit risk assessment method and apparatus

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015003143A2 (en) * 2013-07-03 2015-01-08 Thomson Reuters Global Resources Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
CN103559181A (en) * 2013-11-14 2014-02-05 苏州大学 Establishment method and system for bilingual semantic relation classification model
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107644235A (en) * 2017-10-24 2018-01-30 广西师范大学 Image automatic annotation method based on semi-supervised learning
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium
CN110209809A (en) * 2018-08-27 2019-09-06 腾讯科技(深圳)有限公司 Text Clustering Method and device, storage medium and electronic device
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN110008990A (en) * 2019-02-22 2019-07-12 上海拉扎斯信息科技有限公司 More classification methods and device, electronic equipment and storage medium
CN110298032A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Text classification corpus labeling training system
CN111177374A (en) * 2019-12-13 2020-05-19 航天信息股份有限公司 Active learning-based question and answer corpus emotion classification method and system
CN110751953A (en) * 2019-12-24 2020-02-04 北京中鼎高科自动化技术有限公司 Intelligent voice interaction system for die-cutting machine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
The LODIE team (University of Sheffield) Participation at the TAC2015 Entity Discovery Task of the Cold Start KBP Track;Ziqi Zhang et al.;《https://tac.nist.gov/publications/2015/participant.papers/TAC2015.lodie.proceedings.pdf》;20151130;1-11 *
基于用户行为和项目内容的混合推荐算法研究与应用;艾长青;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180215;I138-2666 *

Also Published As

Publication number Publication date
CN113407713A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN108804532B (en) Query intention mining method and device and query intention identification method and device
CN109165302B (en) Multimedia file recommendation method and device
CN111695345B (en) Method and device for identifying entity in text
US9875301B2 (en) Learning multimedia semantics from large-scale unstructured data
CN109635103B (en) Abstract generation method and device
CN110321537B (en) Method and device for generating file
CN112104919B (en) Content title generation method, device, equipment and computer readable storage medium based on neural network
CN111386686B (en) Machine reading understanding system for answering queries related to documents
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
WO2008124368A1 (en) Method and apparatus for distributed voice searching
CN108369806B (en) Configurable generic language understanding model
CN111324700A (en) Resource recall method and device, electronic equipment and computer-readable storage medium
CN107527619A (en) The localization method and device of Voice command business
CN111341308A (en) Method and apparatus for outputting information
WO2019173085A1 (en) Intelligent knowledge-learning and question-answering
CN111078849B (en) Method and device for outputting information
CN112364235A (en) Search processing method, model training method, device, medium and equipment
CN114756677B (en) Sample generation method, training method of text classification model and text classification method
CN114298007A (en) Text similarity determination method, device, equipment and medium
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
CN111428011A (en) Word recommendation method, device, equipment and storage medium
CN112685534B (en) Method and apparatus for generating context information of authored content during authoring process
CN113407713B (en) Corpus mining method and device based on active learning and electronic equipment
CN111753126A (en) Method and device for video dubbing
CN103870476A (en) Retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40051736

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant