CN114328917A

CN114328917A - Method and apparatus for determining label of text data

Info

Publication number: CN114328917A
Application number: CN202111584829.7A
Authority: CN
Inventors: 赵新歌; 凌悦; 付宇
Original assignee: Shengdoushi Shanghai Technology Development Co Ltd
Current assignee: Shengdoushi Shanghai Science and Technology Development Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-04-12

Abstract

The application provides a method and a device for determining a label of text data. The method comprises the steps of obtaining text data to be predicted; and determining at least one label of the text data to be predicted using a label prediction model, wherein the label prediction model satisfies a constraint associated with a source of the text data. The method aims at different sample sources and credibility of text data and possible unbalanced problems of data samples, and parameters of a label prediction model are determined and adjusted by using corresponding constraint conditions, so that the label determination and classification accuracy of the text data is improved.

Description

Method and apparatus for determining label of text data

Technical Field

The present application relates to data processing, and more particularly, to a method, apparatus, and computer storage medium for determining labels of text data for multi-label classification.

Background

In a service industry such as the catering industry, timely acquisition of feedback information of users on products and services helps to improve the product quality and service level of stores. The user feedback information may not only improve the products and services, but may also guide future business in stores.

The user feedback information is generally obtained by a customer service department in the process of accepting evaluation and complaint of the user, and key features in the user feedback information are extracted in a manual or automatic mode and are subjected to business classification. Schemes for classifying feedback information have evolved from traditional binary classification methods (e.g., good and bad scores) to multivariate classification evaluation methods. Compared with a binary classification method and a multivariate classification method, the feedback information is subjected to multi-label classification by using a more complex classification algorithm and a more complex model, so that more dimensional characteristics of the feedback information can be aimed at, and more excellent feedback information extraction can be realized. Therefore, how to more accurately classify text data by multiple labels becomes a problem to be solved at present.

Disclosure of Invention

To at least partially solve the above-mentioned drawbacks, embodiments of the present application propose a method and an apparatus for determining labels of text data, which are capable of determining and adjusting parameters of a label prediction model using corresponding constraints to improve the accuracy of label determination and classification of text data, in a process of performing multi-label determination of text data, in particular, feedback text data, for multi-label classification, with respect to sample sources and credibility and sample distribution imbalance of different text data.

According to an aspect of the application, a method for determining a label of text data is proposed, comprising: acquiring text data to be predicted; and determining at least one label of the text data to be predicted using a label prediction model, wherein the label prediction model satisfies a constraint associated with a source of the text data.

According to another aspect of the present application, there is provided an apparatus for determining a label of text data, comprising: an acquisition unit configured to acquire text data to be predicted; a tag prediction model configured to determine at least one tag of the text data to be predicted, wherein the tag prediction model satisfies a constraint associated with a source of the text data.

According to yet another aspect of the application, a computer-readable storage medium is proposed, on which a computer program is stored, the computer program comprising executable instructions which, when executed by a processor, carry out a method according to the above.

According to yet another aspect of the present application, an electronic device is provided, including: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to execute the executable instructions to implement the method according to the above.

By adopting the scheme for determining the label of the text data, on the basis of extracting the language habits and characteristics expressed by evaluation in the feedback text data of the user, the loss function is used as the model constraint condition, different values of the sources (such as feedback information of the member user and the non-member user) of the text data to the label of the feedback text data can be fully considered, meanwhile, the credibility of the text data is considered and weighted, and the negative influence of unbalance of training data samples in the process of training the model is reduced. Compared with the traditional multi-label determination and classification schemes of text data and the model training processes of the schemes, the scheme of the application can optimize the parameters of the label prediction model, obviously improve the label marking precision of the text data of the multi-label determination algorithm and the model, and improve the accuracy of the multi-label classification result.

Drawings

The above and other features and advantages of the present application will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

Fig. 1 is a schematic logic block diagram of a system for determining a label for text data according to one embodiment of the present application.

Fig. 2 is a schematic flow chart diagram of a method for determining a label for text data according to one embodiment of the present application.

Fig. 3 is a schematic structural block diagram of an apparatus for determining a tag of text data according to an embodiment of the present application.

FIG. 4 is a block diagram of a schematic structure of an electronic device according to one embodiment of the present application.

Detailed Description

Exemplary embodiments will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. In the drawings, the size of some of the elements may be exaggerated or distorted for clarity. The same reference numerals denote the same or similar structures in the drawings, and thus detailed descriptions thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures, methods, or operations are not shown or described in detail to avoid obscuring aspects of the application.

In this document, a tag determination method and apparatus for text data according to an embodiment of the present application are introduced, taking, as an example, a user evaluation in feedback text information related to food and services extracted from a feedback information scene of restaurants and stores in the catering industry. However, this scenario is merely an example and not a limitation, and one skilled in the art may apply the label determination scheme of text data to a variety of scenarios and industries that require multi-label classification of data. Text data generally refers to a collection of natural language words in the form of sentences or paragraphs. Depending on the application scenario, the text data may come from user feedback text such as a customer service scenario, and may also come from text data containing specific intentions and topics obtained from other sources, so as to extract key feature information from the text data, and label-tag and multi-label classify the text data based on the multi-dimensional attributes of the intentions and topics of the text data. For example, in a customer service scenario, the acquired user feedback information may be recorded in a voice or video format, and in this case, it is also necessary to convert the format into text data information in a text form through a voice recognition or other technology before performing multi-tag determination and classification.

Compared with the traditional binary classification and multi-element classification, the multi-label determination and classification process of the text data by using the label system can extract characteristics and differences of more dimensions or attributes of the text data. The classification results of binary classification (e.g., positive type and negative type of "good" and "bad") have mutual exclusivity, and the sum of the probabilities of text data being classified into the two types is 1. Multivariate classification (e.g., "four different seasonal types," spring, summer, autumn, and winter ") then expands the number of types to multiple, and accordingly, the sum of the probabilities that text data is classified into multiple types remains 1, and mutual exclusivity between types still exists. And in the multi-label classification process, the differences of different attributes and characteristic dimensions are evaluated, and the probability that the text data belongs to the attributes and the characteristic dimensions corresponding to different labels is determined. In general, the probability that text data is determined or labeled as a different label is irrelevant. Depending on the different label hierarchy settings, the text data may have one or more corresponding labels and their probabilities. Generally, the process of determining the labels of the text data for multi-label classification is a process of determining whether the text data has a probability of satisfying a predetermined threshold condition (e.g., the probability of having the attributes and feature dimensions corresponding to the labels is higher than a certain threshold) on the attributes and feature dimensions represented by the corresponding labels. The models and algorithms for multi-label determination and classification of text data are generally different from binary classification and multivariate classification.

In the multi-label determination and classification process of text data, at least one label of the text data is generally determined and predicted based on a matching relationship between attributes or features of the text data, which characterize the subject or intention of the text data, and attributes or dimensions of the labels in a label system. When the text data comes from different sources (for example, feedback text data provided by user groups of different levels and types), the influence of the text data source cannot be reflected only on the basis of text features extracted from the text data. For example, member customers and non-member customers of a restaurant store may enjoy different treatments due to different consumption frequencies and consumption amounts to the store, and the customer service system may be more inclined to focus on feedback information provided by the member customers and preferentially meet the requirements of the member customers in the future when providing food and services. In performing the tag determination and classification process of text data, a user group such as a member should have a higher and more important influence on the multi-tag classification process than the feedback text data of an ordinary user group, so that the model and algorithm for determining the tags of the text data have better feedback sensitivity and classification accuracy for those user groups having higher rank and authority. If only intentions and subjects in the text data are extracted, it is impossible to distinguish whether the feedback texts are from member customers or non-member customers, that is, text features in the text data cannot represent source information of the text data and feedback values of the feedback text data due to the source information. The multi-label classification method based on the text features only considers the language features in the feedback text information and does not consider the values of user feedback information from different user groups, so that the multi-label classification and marking effects are not good enough. Furthermore, when there is an imbalance in the distribution of the number of samples in the model training data used in the multi-label determination and classification, the label labeling and classification effects cannot be optimized.

Not only does the source of the text data cause a certain data difference in the text data, but also the reliability and accuracy (hereinafter, referred to as credibility) of the text data affect the accuracy of the text data label determination, prediction and classification. For example, when training models and algorithms using labeled text data samples with labels as training data, the higher the confidence (e.g., the probability that the label is correctly labeled) of those labeled labels, the higher the quality of the text data with the label, and the better the training and adjustment of the models and algorithms. Conversely, the lower the confidence of a labeled tag (e.g., incorrectly labeled text data or text data that is always incorrectly labeled), the negative impact it has on the training and tuning of models and algorithms. For multi-label determination and classification tasks, higher confidence weights or ratings should be set for those text data that are labeled with high accuracy and text data provided by the source of text data that can provide labeled labels with high accuracy.

It is considered that distinguishing the source and credibility of text data can effectively utilize a good quality text data sample.

Furthermore, during the training of models and algorithms, there may be instances where the distribution of individual labels in the text data sample with labeled labels used for training is not sufficiently uniform relative to the label system. The number of text data samples with certain labels is so small that the parameters associated with these labels are not optimized enough during model and algorithm training and tuning to achieve a high accuracy for these labels. Therefore, there may also be a need to improve the classification performance of models and algorithms for cases of sample imbalance.

Fig. 1 illustrates a logical structure of a system for determining tags of text data for multi-tag classification according to an embodiment of the present application.

A tag prediction model or algorithm 110, which is a core structure of multi-tag determination and classification, can predict at least one tag 103 associated with text data 101 to be predicted based on the obtained text data 101 whose tag is to be predicted. The multi-tag determination or prediction result is not only one tag 103 but also includes a probability (e.g., a probability value of a numerical value or a percentage value between 0 and 1) that the text data is determined to have the tag 103.

The textual data 101 may be from user feedback information collected by a customer service department of a restaurant. The text data 101 fed back by the user is collected and stored from each store, and can also be collected and provided by the customer service department of the superior store or headquarters of the chain restaurant store in a unified manner. The text data 101 may be stored in a local server or database of a store or business, obtained in a wired or wireless manner through a dedicated data interface or user interface, or may be stored in a server or database provided remotely or over a network (e.g., a cloud server), obtained in a wired or wireless manner through an interface/interface similar to or different from the local scenario.

The textual data 101, which may also be referred to as a textual corpus, is a symbolic natural language sentence or combination of sentences, typically including an ordered combination of word symbols. Before using the text data 101, it may be preprocessed to obtain processed text data including keywords and the like constituting a sentence. The processed text data may further include first sentence markers CLS for segmenting and marking sentences when a plurality of sentences (e.g., sentence pairs) are included in the text data. Preprocessing may include deleting unnecessary punctuation, conjunctions, adverbs, and inflectional words in the sentence, and other symbols (e.g., emoticons and pictures) that are not relevant to recognition of the text data 101, and the like.

The label prediction model 110 may further include a feature extraction submodel 111 and a classification submodel 112. The feature extraction submodel 111 is used to extract the text features 102 of the text data 101 to be predicted. The classification submodel 112 then computes at least one label 103 associated with the text data 101 and its probability as a multi-label determination and classification result based on the text features 102.

The text feature 102 of the text data 101 is feature data extracted from the text data 101 and capable of characterizing an intention or a subject included in the text data 101, and has attribute or dimension information related to each tag of the tag system. The feature extraction submodel 111 may be generated by an algorithm or model based on natural language processing NLP or the like. The feature extraction submodel 111 may employ, for example, a BERT model. Those skilled in the art will appreciate that the feature extraction submodel 111 may also be implemented using algorithms or models of conventional technology, or using machine learning model structures or neural network model structures. The neural network model can be a deep neural network DNN model structure or a convolutional neural network CNN model structure. In particular, a model such as Roberta (e.g., Roberta-wwm) suitable for large-scale language processing may also be employed. Roberta is a pre-trained network model based on WIKI encyclopedia news data and the like, and is pre-trained by combining calibrated non-label user feedback information from restaurant stores as training data, so that text features with user evaluation language expression habits can be better learned. The Roberta-wmm model extracts word vectors with context information for text features.

In acquiring text data such as from a customer service system, not only the ordinary text data to be predicted 101 but also text data 104 (i.e., text data at least not having a label) from which a text feature has been extracted but not having a label (i.e., only a text feature is labeled and not a label is labeled) or both a text feature and a label are not labeled, and labeled text data 105 having a label are included. The text data 101 to be predicted may also include text data 104, but generally the text data 101 to be predicted does not include text data 105 with tags. The tagged text data 105 may be text data that is not tagged with text features. Unlabeled text data 104 and labeled text data 105 can be from customer service systems, manually labeled and calibrated by business personnel, or from specialized or standardized training data sets provided by other associated systems, servers, databases, etc., respectively.

Since the process of extracting features from text data 101 or processed text data does not require the introduction of label information, the feature extraction submodel 111 may obtain a pre-trained model of the feature extraction submodel 111 in advance through pre-training 121 using training data composed of text data 104. According to the embodiment of the application, the feature extraction submodel 111 may also be trained directly based on a database or a data platform including massive text data provided by a user feedback database provided and maintained by restaurants and enterprises or a user feedback database provided and maintained by a third-party service provider.

The classification submodel 112 for predicting and determining labels 103 of the text data 101 based on the text features 102 may then be constructed using a model or algorithm capable of multi-label classification. The classification submodel 112 may be based on algorithms or model structures such as conventional techniques, and may be implemented using machine learning model structures or neural network model structures. The neural network model can be a deep neural network DNN model structure or a convolutional neural network CNN model structure. Further, the graph neural network model structure gnn (graph neural network) may be used in the multi-label determination and classification process to assist in the label determination and classification task, such as introducing other data or information related or unrelated to the text data through the graph data features to improve the performance of the multi-label determination and classification model. GNNs include, for example, the graph convolutional neural network GCNN, the graph ATtention neural network gat (graph ATtention network), and the like.

The model is characterized by the architecture and parameters of the model, wherein the architecture of the model is the premise for realizing the basic functions of the model, and the parameters of the model determine the functional details and the performance of the model. Depending on the implemented functionality, the architecture and parameters of the model, in particular the parameters of the model, need to satisfy the constraints for defining and limiting the model. Under the control of the constraint conditions, the architecture and parameters of the model meet the requirements of functions and design, and the model can be further optimized. Therefore, the generated classification submodel 112 needs to satisfy the constraint condition so that the model parameters, particularly the optimal model parameters, of the classification submodel 112 can be determined to accurately determine or predict the label 103 of the text data 101 to be predicted. The constraints satisfied by the classification submodel 112 are also an important part of the constraints of the multi-label determination and classification function satisfied by the label prediction model 110. Because the feature extraction submodel 111 performs the function of text feature extraction and can be pre-trained using text data that is not labeled, setting the constraints of the classification submodel 112 and solving the constraints to obtain the optimal solution of the framework and parameters of the model are important steps in creating the classification submodel 112.

The constraint may be selected in various ways, and may include, for example, a polynomial, a function, etc. having a preset conditional constraint. The loss function is a typical model constraint condition, and the function value of the physical quantity, the logical quantity, the process quantity, or the like represented by the loss function can be optimized (for example, the cost value or the loss value falls into local optimum or global optimum) by solving the function value under a preset condition (for example, a threshold value or a threshold range). According to embodiments of the application, the constraints of the classification submodel 112 may take the form of a loss function. The value of the loss function should satisfy a preset condition to make the parameters of the model in an optimal state. The predetermined condition may be that the loss function value reaches a target threshold of maximum or minimum value, or falls within a predetermined range of values, e.g. the value of the loss function has a minimum value or falls within an epsilon confidence interval of the target threshold.

As described above, it is desirable to perform multi-label determination and classification using algorithms or models that can distinguish the source and confidence of text data and eliminate or mitigate sample imbalances, so the constraints of the classification submodel 112 should contain such information and factor-related impact factors. The loss function that takes into account the origin and credibility of the text data should include different influence factors set to the text data of different origins and the text data with different credibility, which may be weights, for example, and ultimately act in the calculation of the value of the loss function.

Details of the selection and setting of the penalty function for the constrained label prediction model 110, and in particular the classification submodel 112, for embodiments of the present application will be described below.

Typical methods for the sample imbalance problem include interpolation, weight adjustment, and the like. The interpolation method is a traditional method for expanding sample data, is too complex when being used in the multi-label determination and classification process of the text data, and the interpolation effect of the interpolation method is difficult to control along with the increase of the number of the samples of the text data. Therefore, the weight adjustment brings better effect in the sample balance application of the text data.

A binary cross-entropy (Bce loss) loss function is typically employed with fixed weights as a loss function that can be used for the binary classification model. When the problem of sample imbalance is faced, although the Bce loss function can increase the weight of the samples corresponding to the classes with a small number to represent the importance of the samples so as to emphasize the classification performance of the training model for the input samples of the classes during model training, the Bce loss function can only use fixed weights to adjust, and cannot adjust the weights at any time during the training process.

Equation (1) shows the general expression of the constrained Bce loss function of the classifier model that can be used for multi-label classification:

wherein,

n is the number of samples of text data input by the model.

And C is the label number of the label system in the multi-label determining and classifying process.

L is the total loss of text data for n samples.

y_i,cThe ith sample of text data is determined to have the actual probability of label type C (or that the sample of text is divided into categories characterized by label C), where i and C are both positive integers and i is 1,2, …, n, C is 1,2, …, C. E.g. y_i,cIs 0 or 1, wherein 1 indicates the label is present and 0 indicates the label is absent. Y may also be set_i,cWith other values to represent the true probability distribution.

The i-th text data sample is determined to have a prediction probability of the label type c (or that the text sample is divided into categories characterized by the label c). And y_i,cThe value of (a) corresponds to the value of (b),

it may take a value between 0 and 1 or be selected from other ranges of values greater than 0.

According to embodiments of the application, a local loss function may be used to introduce source and confidence with the text data, and influence factor terms for improving sample distribution imbalance.

The Focal loss function can be expressed by the following equation (2):

wherein,

L_flrepresenting the loss at event y.

y is the true probability of the occurrence of an event,

is the predicted probability of an event occurring.

α is a balance factor for balancing the positive and negative samples, and γ is a focusing factor.

When the event y is the multi-label determination and classification process determining the ith text data sample as having the label c, equation (2) may be expressed as equation (3):

wherein,

L_i,cindicating that the ith sample of text data is determined to have a loss of label type c (or that the sample of text is classified into the category characterized by label c).

Therefore, the amount distribution of the text data samples of each label can be effectively adjusted by introducing the balance factor and the focusing factor into the Focal loss function, and the parameter optimization degree of the model on the labels with a small amount of text data samples is improved, so that the accuracy of determining and classifying the specific labels and all the labels is obtained.

For all n text data samples, the total loss L is represented by equation (4):

L＝1/c×∑_i,cL_i,c (4)

wherein,

calculating the loss L for each of i and c_i,cThe sum of (a) and (b).

Considering the influence factors of the source and credibility of the text data, the loss L of each text data sample i can be calculated when the total loss L is calculated_i,cWeighting is performed, and the weighted total loss L is calculated by equation (5):

L＝1/c×∑_i,cω_iL_i,c (5)

wherein,

ω_ithe weight corresponding to the ith text data sample characterizes the source of the text data (sample) and/or the influence of the credibility of the data on the loss function.

Weight ω_iRelated to the source of the text data. For example, for a user group to which the user from which the feedback text data comes belongs, the weight of the member user may be higher than that of the non-member user, so that the loss function is more likely to be concerned with the intention and subject included in the feedback text data given by the member user. Setting a weight ω_iThe portion of the text data associated with the source is ω_aThen ω corresponding to the member user_aGreater than omega corresponding to non-member users_aE.g. weight fraction omega of member users_aMay be a multiple of the corresponding value of the non-member user, e.g., 3 times, 5 times, 10 times, etc. When the source of the text data is divided into a plurality of sources having different levels (e.g., from a user group having a plurality of levels, membership having different levels), the weight portions ω of the text data may be provided to the sources having different levels, respectively_aDifferent weight values are set. The level may correspond to a weight value, the higher the level, the larger the weight value, or a multiple of the weight value with respect to the lowest level.

Weight ω_iBut also to the trustworthiness of the text data. Confidence generally refers to the confidence of whether the tag of the tagged text data is correct. Confidence may also be understood as the rate of correctness or probability of correctness that the tagged label is the correct label of the text data. These labeled text data are typically used as training data for the classification submodel 112 to determine, optimize, and adjust the parameters of the model. The trustworthiness of the tags of tagged text data provided by each different source typically differs, and may also differ between tagged text data provided by sources of the same type or different tagged text data provided by the same source. The weight component ω associated with the degree of reliability can be set for text data having different degrees of reliability_bE.g. may be of different degrees of confidenceGrouping the text data, the credibility of the text data in each group meeting the corresponding range of the credibility value, and setting a corresponding weight part omega for the group of text data_b. The higher the confidence (i.e. the higher the accuracy) of the weight component ω of the tagged text data (group) of the tagged text data_bThe larger. According to the embodiment of the application, when the customers of the restaurant are divided into members and non-members, 5 omega is set for the member users and the non-member users respectively₁And ω₁The source weight portion of (a); then, the labeled feedback text data from the respective users are sampled and checked for label labeling quality among the member users and the non-member users and grouped by correct rate if the ratio of the correct rate of each group of text data to the correct rate of the group with the best correct rate (R0) is R_k(e.g., k is the group number of the correct rate group), the confidence weight part ω corresponding to each group of text data_bCan be obtained by R0R_kAnd (4) calculating.

The weight ω may be calculated by a weight determination unit, a model, or an algorithm based on at least one of the source and the reliability of the text data_iAnd then provided to the loss function, e.g., in the form of an impact factor, or may be obtained from outside the system, e.g., via a data interface, as part of the obtained text data-related input.

Introducing a weight omega of a source and confidence information of text data_iCan be expressed as ω_aAnd ω_bThe product of two weight components, i.e. ω_i＝ω_a×ω_bThis is equivalent to a double weighting based on source and confidence for each sample of text data.

According to an embodiment of the application, the weight ω is_iIt can be applied not only to the Focal loss function, but also to other loss functions, such as the Bce loss function described above.

If labeled text data 105 having labels is obtained, these text data 105 may be used as training data to form a training data set for training the classification submodel 112. Before using the classification submodel 112, the parameters of the model may be pre-trained 122 by the training data set.

During the use of the system, the label 103 of the text data of the label 103 determined by the label prediction model 110 may also be calibrated, so as to obtain the label-calibrated text data. The label-calibrated text data is used as incrementally updated training data, and parameters of the overall model or partial units or models (e.g., at least one of the feature extraction sub-model 111 and the classification sub-model 112) of the multi-label determination and classification system of text data are again collated and refined 122. The fine-tuning and updating of model parameters may make the parameters of the tag prediction model 110 more optimal, and may synchronously adjust the parameters of the model and tag determination and classification performance for changes in the text data over time.

As described above, the feature extraction sub-model 111 may be pre-trained during the pre-training 121 and the classification sub-model 112 may be pre-trained or fine-tuned during the pre-training 122, respectively, or the feature extraction sub-model 111 and the classification sub-model 112 may be pre-trained or fine-tuned as the whole label prediction model 110, and the operations of 121 and 122 are integrated into a pre-training or fine-tuning operation. In pre-training and fine-tuning the global model, a training data set of labeled training data, such as text data 105, is used. While the classification submodel 112 or the overall label prediction model 110 is pre-trained, the labeled training dataset may also be used to fine tune the feature extraction submodel 111.

The multi-label classification system can also perform statistical analysis on the generated labels 103, obtain analysis data 106 of label distribution conditions of text data and the like, and present the analysis data to customer service staff and management staff of catering stores in forms of texts, reports, diagrams and the like so as to guide customer service and make policies for management and formulation of stores or enterprises.

Fig. 2 illustrates an example flow of a method for determining labels of text data for multi-label classification in accordance with an embodiment of the present application. Wherein portions that are the same as or similar to the system flow presented in fig. 1 will not be described in detail.

The method mainly comprises a step S210 for obtaining data to be predicted and a step S240 for determining at least one label of text data to be predicted by using a trained label prediction model. The label prediction model 110 may be generated through a modeling and training process, and specifically may include a step S220 of generating the label prediction model and a step S230 of determining parameters of the label prediction model based on training data.

The data to be predicted acquired in step S210 is text data without a label.

In step S220, the tag prediction model should be limited by a constraint condition, which is at least related to the source of the text data. Further, the constraints are also related to the trustworthiness of the (tagged) text data. Further, the constraint should be able to eliminate or reduce the effect of the text data sample imbalance corresponding to a certain label or labels. Step S220 further includes generating two sub-models included in the label prediction model 110, respectively: sub-steps S221 and S222 of the feature extraction submodel 111 and the classification submodel 112.

Step S230 may further include determining parameters of the two sub-models. Here, the parameter determination of the feature extraction submodel 111 is performed in sub-step S231 of step S230, and is performed by pre-training the feature extraction submodel 111 generated in sub-step S221 with training data composed of labeled text data having text features. The parameter determination of the classification submodel 1112 is performed in sub-step S232 of step S230, and is accomplished by pre-training the classification submodel 112 generated in sub-step S222 with training data made up of labeled text data acquired in step S210. In determining the classification submodel 112, the training process of the model is supervised using constraints constituted by the loss function 201.

In step S240, the label of the text data to be predicted is determined based on the label prediction model 110, wherein the label prediction model satisfies the constraint condition associated with the source of the text data. The constraint is characterized at least by a preset condition that is satisfied by the values of the loss function 201 that have been pre-trained and adjusted the classification submodel 112 in sub-step S232. Step S240 may comprise two substeps S241 and S242. Wherein the feature extraction submodel 111 completed by training respectively extracts the text features of the text data to be predicted in sub-step S241, and the classification submodel 112 completed by training (or fine-tuned or updated using updated training data) determines a plurality of labels of the text data to be predicted based on the extracted text features in sub-step S242.

The generation of the tag prediction model 110 and the parameter determination process, i.e., steps S220 and S230, may be completed before determining a plurality of tags of the text data using the tag prediction model 110. Thus, when multi-label classification is performed using the label prediction model 110 with well-determined parameters, the method may only include steps S210 and S240, while steps S220 and S230 exist in the modeling and training phase of the model.

The method may further comprise a step S250 of generating label analysis data based on the label results obtained by the model, as indicated by the dashed box in fig. 2.

FIG. 3 introduces an exemplary logical structure of an apparatus for determining multi-labels of text data for multi-label classification according to an embodiment of the present application.

The apparatus 300 may include an acquisition unit 310, a label prediction model 320, and a training unit 330.

The obtaining unit 310 is configured to obtain labeled text data with a label and unlabeled text data, where the labeled text data constitutes training data of the label prediction model 320 (especially, a classification submodel in the label prediction model 320), and the unlabeled text data is used as text data to be predicted for which the label needs to be determined.

The label prediction model 320 determines at least one label of the text data to be predicted and is limited by constraints associated with at least one of a source, a confidence level, and an unbalanced distribution of the text data samples. In addition to including a classification submodel, the label prediction model 320 also includes a feature extraction submodel. The feature extraction submodel is used for completing the function of extracting text features from the text data, and the classification submodel is used for completing the function of determining the labels of the text data based on the text features.

Before using the label prediction model 320, a model needs to be built and parameters of the model and submodels need to be determined, i.e., the model is pre-trained. The building and training of the model is done by the training unit 330, as indicated by the dashed box in fig. 3. The training unit 330 may also retrain, update, or fine tune parameters of the label prediction model 320 (e.g., further including classification submodels therein) using the label calibrated text data to form updated or incrementally updated training data to optimize the model.

The apparatus 300 may further include an analyzing unit (not shown) for analyzing and counting tag data of the resulting text data, and providing related distribution information to management and business personnel. The device 300 may further comprise a weight determination unit (not shown) for determining weight information of the acquired text data associated with its origin and/or trustworthiness, or receiving the weight information from outside the device 300.

It should be noted that although in the above detailed description several modules or units of the system for determining labels of text data are mentioned, this division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units. The components shown as modules or units may or may not be physical units, i.e. may be located in one place or may also be distributed over a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

In an exemplary embodiment of the present application, there is further provided a computer-readable storage medium, on which a computer program is stored, the program comprising executable instructions, which when executed by, for example, a processor, may implement the steps of the method for determining a tag of text data described in any one of the above embodiments. In some possible implementations, various aspects of the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present application described in the method for determining a label of text data of the present specification, when the program product is run on the terminal device.

A program product for implementing the above method according to an embodiment of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In an exemplary embodiment of the present application, there is also provided an electronic device that may include a processor, and a memory for storing executable instructions of the processor. Wherein the processor is configured to perform the steps of the method for determining a label of text data in any of the above embodiments via execution of the executable instructions.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 400 according to this embodiment of the present application is described below with reference to fig. 4. The electronic device 400 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 4, electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: at least one processing unit 410, at least one memory unit 420, a bus 430 that connects the various system components (including the memory unit 420 and the processing unit 410), a display unit 440, and the like.

Wherein the storage unit stores program code executable by the processing unit 410 to cause the processing unit 410 to perform steps according to various exemplary embodiments of the present application described in the method for determining a tag of text data of the present specification. For example, the processing unit 410 may perform the steps as shown in fig. 2.

The storage unit 420 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)4201 and/or a cache memory unit 4202, and may further include a read only memory unit (ROM) 4203.

The storage unit 420 may also include a program/utility 4204 having a set (at least one) of program modules 4205, such program modules 4205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 430 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 400 may also communicate with one or more external devices 500 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 400, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 400 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 450. Also, the electronic device 400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 460. The network adapter 460 may communicate with other modules of the electronic device 400 via the bus 430. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the method for determining the tag of the text data according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

Claims

1. A method for determining a label of text data, comprising:

acquiring text data to be predicted; and

determining at least one label of text data to be predicted using a label prediction model, wherein the label prediction model satisfies a constraint associated with a source of the text data.

2. The method of claim 1, wherein the constraint is further associated with a trustworthiness of the textual data.

3. The method of claim 1 or 2, wherein the label prediction model is trained using training data comprised of text data with labels.

4. The method of claim 3, wherein the label prediction model is trained using a loss function, and wherein the constraint condition comprises that a value of the loss function satisfies a preset condition.

5. The method of claim 4, wherein the loss function determines the value of the loss function by weighting the loss of each of the text data based on the source of the text data.

6. The method of claim 5, wherein the loss function determines the value of the loss function by weighting the loss of each of the text data based on the confidence level of the text data.

7. A method according to claim 5 or 6, wherein the loss function comprises a Focalloss loss function.

8. The method of claim 1, wherein determining at least one label of text data to be predicted using the label prediction model further comprises:

extracting text features of text data to be predicted;

at least one label of the text data to be predicted is determined based on the text features.

9. The method of claim 8, wherein the label prediction model comprises a feature extraction submodel for extracting the text features.

10. The method of claim 9, wherein the training data further comprises unlabeled text data, and wherein the unlabeled text data is used to determine parameters of the feature extraction submodel.

11. The method of claim 8, wherein the label prediction model comprises a classification submodel for determining labels of text data to be predicted.

12. The method of claim 11, further comprising adjusting parameters of at least one of the classification submodel and the tag prediction model using text data with calibrated tags.

13. The method of claim 1, wherein the label prediction model comprises a machine learning model structure or a neural network model structure.

14. The method of claim 1, further comprising generating analysis data for the determined label of the text data.

15. The method of claim 1, wherein the text data comprises user feedback data.

16. The method of claim 15, wherein the user feedback data comprises catering user feedback data.

17. An apparatus for determining a label of text data, comprising:

an acquisition unit configured to acquire text data to be predicted;

a tag prediction model configured to determine at least one tag of text data to be predicted, wherein the tag prediction model satisfies a constraint associated with a source of the text data.

18. A computer-readable storage medium, having stored thereon a computer program comprising executable instructions that, when executed by a processor, carry out the method according to any one of claims 1 to 16.

19. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the executable instructions to implement the method of any of claims 1 to 16.