CN117236384A

CN117236384A - Training and predicting method and device for terminal machine change prediction model and storage medium

Info

Publication number: CN117236384A
Application number: CN202311056708.4A
Authority: CN
Inventors: 邱婉; 赵学峰; 纪春芳; 郭曦煜; 刘遥遥
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-12-15

Abstract

The application discloses a training and predicting method and device for a terminal machine change prediction model and a storage medium. The method comprises the following steps: performing data preprocessing on a sample data set, wherein the sample data set comprises a preset number of sample data, and each sample data comprises: user data related to mobile communication and label information of a user, wherein the label information is used for representing whether the user is a user for terminal machine changing in a set time; generating a global prior attention matrix based on the user data after the data preprocessing; training a terminal machine changing prediction model based on the sample data set after data preprocessing to obtain a trained terminal machine changing prediction model; the terminal machine change prediction model adopts a model which is integrated with a global priori attention matrix in a transducer. The standard dot product attention distribution can be corrected and enhanced based on the global priori attention matrix, so that the prediction performance of the terminal change prediction model based on the transducer is effectively improved.

Description

Training and predicting method and device for terminal machine change prediction model and storage medium

Technical Field

The present application relates to the field of terminal services, and in particular, to a method, an apparatus, and a storage medium for training and predicting a terminal change prediction model.

Background

The sales of mobile terminals and matched products have become strategic cores of telecom operators, and accurately predicting the user switching is beneficial to the operators to promote the sales of the mobile terminals and expand the market scale. However, user change behavior is affected by a number of factors, such as: the user age, consumption habit, mobile phone dependence, conversation behavior, flow use behavior and the like bring great challenges to terminal change prediction.

In the related art, in order to realize accurate marketing of an operator terminal and a matched product thereof, the influence of the above factors on the user machine changing behavior needs to be deeply mined, so that the user machine changing is accurately predicted. For example, through big data prediction technologies such as machine learning and deep learning, potential customers of terminal exchange are accurately identified, support is provided for terminal business marketing of operators, and marketing conversion rate is improved.

In view of the characteristics of simplicity, effectiveness and strong interpretability, a tree model has long taken an important role in the field of structured data, but along with the application of a transducer in the fields of natural language processing, computer vision and the like, the variation of a transducer network structure also appears in a plurality of structured data tasks, such as potential customer mining in the marketing field, bad account identification in the financial field and the like.

The transducer is a sequence-to-sequence model composed of an encoder and a decoder, and the core components of the encoder module comprise a multi-head attention module and a position feed-forward network. The transducer obtains the attention score between each sample feature through the dot product operation of the feature intermediate layer ebadd, such local attention distribution easily causes over fitting of the model, and affects the robustness of the model, in addition, the training of ebadd needs to consider optimization targets (for example, whether the training is a potential client of 5G terminal exchange or not) and the attention distribution obtained in this way cannot represent the association relationship between the features very accurately.

Disclosure of Invention

In view of the above, the embodiments of the present application provide a method, an apparatus, and a storage medium for training and predicting a terminal change prediction model, which aim to effectively improve the prediction performance of the terminal change prediction model based on a transducer.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a training method for a terminal change prediction model, including:

Performing data preprocessing on a sample data set, wherein the sample data set comprises a preset number of sample data, and each sample data comprises: user data related to mobile communication and label information of a user, wherein the label information is used for representing whether the user is a user for terminal machine changing in a set time;

generating a global prior attention matrix based on the user data after the data preprocessing;

training a terminal machine-changing prediction model based on the sample data set after the data preprocessing to obtain a trained terminal machine-changing prediction model;

the terminal machine change prediction model adopts a model which is fused into the global priori attention matrix in a transducer.

In the above solution, the user data includes a plurality of feature data associated with a terminal service, and the generating a global prior attention matrix based on the user data after the data preprocessing includes:

constructing a feature prediction model corresponding to each feature data in the plurality of feature data, wherein the feature prediction model is used for predicting the target feature based on the rest feature data except the target feature data in the plurality of feature data, and the target feature is any feature data in the plurality of feature data;

Training each feature prediction model based on the preprocessed user data; based on the trained feature prediction models, attention values between the target feature data and the rest feature data corresponding to the feature prediction models are obtained;

a global prior attention matrix is generated based on the attention values between each of the target feature data and the remaining feature data.

In the above scheme, the method further comprises:

and correcting the global prior attention matrix based on the characteristic association attribute between any two characteristic data in the plurality of characteristic data.

In the above scheme, training a terminal machine-changing prediction model based on the sample data set after the data preprocessing to obtain a trained terminal machine-changing prediction model, including:

training a terminal machine-changing prediction model based on the sample data after the data preprocessing, and solving a predicted value of the terminal machine-changing prediction model;

calculating a loss value of the predicted value and a true value corresponding to the tag information based on a loss function;

and determining that the loss value converges or the iteration number of training reaches the set number of times, and obtaining the trained terminal machine-changing prediction model.

In the above scheme, the training the terminal machine-changing prediction model based on the sample data after the data preprocessing, and obtaining the predicted value of the terminal machine-changing prediction model includes:

inputting the sample data after the data pretreatment into an embedding layer in a transducer to obtain a first vector;

processing the first vector based on a first normalization layer to obtain a second vector;

obtaining an intermediate value representing a dot product attention distribution for the second vector based on a transducer's attention mechanism;

carrying out weighted summation on the intermediate value and the global priori attention matrix, and multiplying the weighted summation value and a value vector matrix to obtain a third vector;

adding the first vector and the third vector to obtain a fourth vector, and processing the fourth vector based on a second normalization layer to obtain a fifth vector;

calculating the fifth vector based on a feedforward neural network to obtain a sixth vector;

adding the sixth vector and the fourth vector, and obtaining a seventh vector based on dimension transformation;

and inputting the seventh vector into a multi-layer perceptron to obtain a predicted value of the sample data.

In a second aspect, an embodiment of the present application provides a method for predicting a terminal change, including: inputting user data of a user to be predicted into a terminal machine-changing prediction model trained by the method according to the first aspect of the embodiment of the application to obtain a prediction probability value of terminal machine-changing of the user to be predicted.

In the above scheme, the method further comprises:

sequencing the prediction probability values of a plurality of users to be predicted;

selecting the first N users of the predictive probability value according to the sorting direction from big to small, and determining the first N users as target groups of terminal machine changing service;

wherein N is a natural number greater than 1.

In a third aspect, an embodiment of the present application provides a training device for a terminal change prediction model, including:

the preprocessing module is used for preprocessing data of a sample data set, the sample data set comprises a preset number of sample data, and each sample data comprises: user data related to mobile communication and label information of a user, wherein the label information is used for representing whether the user is a user for terminal machine changing in a set time;

the global prior attention generation module is used for generating a global prior attention matrix based on the user data after the data preprocessing;

the training module is used for training a terminal machine-changing prediction model based on the sample data set after the data preprocessing to obtain a trained terminal machine-changing prediction model;

In a fourth aspect, an embodiment of the present application provides a device for predicting a terminal change, including:

the prediction module is used for inputting the user data of the user to be predicted into the terminal machine-changing prediction model trained by the training device according to the third aspect of the embodiment of the application to obtain the prediction probability value of the terminal machine-changing of the user to be predicted.

In a fifth aspect, an embodiment of the present application provides a training device for a terminal change prediction model, including: a processor and a memory for storing a computer program capable of running on the processor, wherein the processor is adapted to perform the steps of the method according to the first aspect of the embodiment of the application when the computer program is run.

In a sixth aspect, an embodiment of the present application provides a prediction apparatus for a terminal exchange, including: a processor and a memory for storing a computer program capable of running on the processor, wherein the processor is adapted to perform the steps of the method according to the second aspect of the embodiments of the application when the computer program is run.

In a seventh aspect, embodiments of the present application provide a computer storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the method according to any of the embodiments of the present application.

According to the technical scheme provided by the embodiment of the application, the data preprocessing is carried out on the sample data set, wherein the sample data set comprises a preset number of sample data, and each sample data comprises: user data related to mobile communication and label information of a user, wherein the label information is used for representing whether the user is a user for terminal machine changing in a set time; generating a global prior attention matrix based on the user data after the data preprocessing; training a terminal machine changing prediction model based on the sample data set after data preprocessing to obtain a trained terminal machine changing prediction model; the terminal machine change prediction model adopts a model which is integrated with a global priori attention matrix in a transducer. Therefore, the global prior attention matrix can be generated based on the user data after the data preprocessing, so that the global prior attention matrix can effectively capture the association relation among the user data features, and the global prior attention matrix is integrated into the transducer, so that the dot product attention distribution of the standard can be corrected and enhanced based on the global prior attention matrix, and further the prediction performance of the terminal machine change prediction model based on the transducer is effectively improved.

Drawings

FIG. 1 is a flow chart of a training method of a terminal change prediction model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a global prior attention matrix in an example application of the present application;

fig. 3 is a schematic architecture diagram of a 5G terminal change prediction model with a global prior attention mechanism integrated in an application example of the present application;

fig. 4 is a schematic structural diagram of a training device for a terminal change prediction model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a prediction device for terminal exchange according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a training device of a terminal change prediction model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a prediction device for terminal exchange according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings and examples.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

In the related art, a transducer obtains the attention score between each sample feature through dot product operation of feature intermediate layer embellishing, the dot product attention distribution does not consider any prior information, the similarity measurement of the feature intermediate layer embellishing of a single sample is completely depended on, model overfitting is easy to cause, and the training of the embellishing needs to consider optimization targets, so that the obtained attention distribution cannot represent the association relation between the features very accurately. Based on this, some global a priori attention profile may be added to compensate for the lack of local attention profile. Common prior attention distribution includes Gaussian distribution, discrete uniform distribution and the like, and the prior thought assumes that the characteristic association relation of terminal machine-changing service accords with certain statistical distribution.

However, the prior attention distribution ideas such as gaussian distribution, discrete uniform distribution and the like are too simple, and the association relation between the features is not fully learned from the data; in addition, the prior attention distribution ideas such as Gaussian distribution, discrete uniform distribution and the like do not consider the association relation among the features from the practical service angle. Thus, this simple assumption clearly does not conform to the actual business implications, and no success is seen.

Based on the above, in various embodiments of the present application, a global prior attention matrix which accurately characterizes the association relationship between features and accords with the meaning of terminal change service is learned from input data, and the dot product attention distribution of the standard is corrected and enhanced based on the global prior attention matrix, so as to effectively improve the prediction performance of the terminal change prediction model based on a Transformer.

The embodiment of the application provides a training method of a terminal machine change prediction model, as shown in fig. 1, comprising the following steps:

step 101, performing data preprocessing on a sample data set, wherein the sample data set comprises a preset number of sample data, and each sample data comprises: the mobile communication terminal comprises user data related to mobile communication and label information of a user, wherein the label information is used for representing whether the user is a user for terminal machine changing within a set time.

Here, the sample data set may include a plurality of positive sample data including mobile communication-related user data and tag information indicating a user who terminal is switched for a set time, and a plurality of negative sample data including mobile communication-related user data and tag information indicating a user who terminal is not switched for a set time.

Here, the user data includes a plurality of feature data associated with the terminal exchange service. For example, the user data may include: the basic information, communication behavior information, telecommunication service handling condition and other relevant characteristic data of the user can be further used for carrying out modeling analysis based on the user data to predict the probability of terminal machine changing of the user in a set time. The set time may be set based on demand, for example, one week, one month, etc. It will be appreciated that the user data for each user in the sample data set is historical data prior to terminal change.

Here, the data preprocessing may include:

converting the characteristic data based on the type of the characteristic data;

the tag information is converted into a binary value.

For example, the feature data may be divided into continuous variables and category variables, and the continuous variables may be converted by using normalization operation to obtain feature data with consistent dimensions; for the category type variable, different values of all category characteristics can be coded uniformly, and a uniform category representation mode is obtained. For the tag information, the value of "whether the terminal is replaced in the set time" may be converted to 0 or 1.

Step 102, generating a global prior attention matrix based on the user data after the data preprocessing.

Here, the user data of each user in the sample data set can be adaptively learned to obtain a global priori attention matrix capable of reflecting the association relation between the features of the user data, so that the dot product attention distribution of the standard in the transducer can be corrected and enhanced, the network overfitting caused by individual differences of the users is prevented, and the prediction effect of the terminal change prediction model is further improved.

Step 103, training a terminal machine-changing prediction model based on the sample data set after the data preprocessing to obtain a trained terminal machine-changing prediction model; the terminal machine change prediction model adopts a model which is fused into the global priori attention matrix in a transducer.

It can be understood that the embodiment of the application can generate the global prior attention matrix based on the user data after the data preprocessing, so that the global prior attention matrix can effectively capture the association relation between the user data characteristics, and the global prior attention matrix is integrated into the transducer, so that the dot product attention distribution of the standard can be corrected and enhanced based on the global prior attention matrix, and the prediction performance of the terminal machine change prediction model based on the transducer is further effectively improved.

Illustratively, the generating a global prior attention matrix based on the data-preprocessed user data includes:

It should be noted that, in order to obtain the global prior attention matrix, that is, obtain the global correlation between every two feature data, a prediction task needs to be constructed for each feature data, that is, a feature prediction model needs to be constructed. For example, a feature prediction model for predicting the target feature data by using the rest feature data can be constructed based on the tree model, the prior attention value of the target feature data to the rest feature data is obtained by the shape value after model training is completed, each feature data is traversed, and a global prior attention matrix is generated.

In an application example, assuming that the number of feature data included in the user data is k, a corresponding feature prediction model may be constructed for each feature data, i.e., k feature prediction models may be constructed. The feature prediction model may be understood as learning a mapping relationship from the remaining feature data to the virtual tag, wherein the target feature data is regarded as the virtual tag. For example, the prediction task of the mth feature data can be expressed as the following formula (1):

X _m ＝F _m (X ₁ ,X ₂ ,...,X _m-1 ,X _m+1 ,...,X _k ) (1)

wherein, virtual tag X _m Is a dependent variable, the rest of the characteristic data X ₁ ，X ₂ ，...，X _m-1 ，X _m+1 ，...，X _k Is an independent variable, F _m Is a mapping function from the remaining feature data to the virtual tag. If the virtual tag is a continuous variable, the prediction task is a regression task with respect to the virtual tag, and the loss function may be a mean square error; if the virtual tag is a category variable, then the task is predictedRegarding the classification task of virtual tags, the loss function may be cross entropy loss.

Illustratively, the method further comprises:

It should be noted that, for the terminal service, a strong correlation may exist between some feature combinations, and a weak correlation may exist between some feature combinations, based on this, feature correlation attributes may be classified based on expert experience, for example, into a first class that represents strong correlation, a second class that represents strong correlation, and a third class that represents weak correlation, for the feature combinations of the third class, an attention value between two features may be set to minus infinity, as shown in fig. 2, and an attention value between the remaining features may be kept unchanged, so as to obtain a final global attention matrix. Thus, from the service perspective, the feature combination with extremely weak relevance from the service level can be further filtered, and the concentration of attention seeking is enhanced.

The training the terminal machine change prediction model based on the sample data set after the data preprocessing to obtain a trained terminal machine change prediction model includes:

The loss function may be a cross entropy loss function, sample data with sample label information is used for training a terminal machine-changing prediction model, so that iterative optimization of model parameters is achieved, a loss value is obtained for a true value corresponding to a predicted value and label information based on the loss function, further iteration termination conditions are determined, and a trained terminal machine-changing prediction model is obtained.

The training a terminal change prediction model based on the sample data after the data preprocessing, and solving a predicted value of the terminal change prediction model, includes:

It should be noted that, the terminal change prediction model in the embodiment of the present application is different from the existing transform model in that: a mechanism for merging the global prior attention moment matrix is introduced, namely after obtaining the intermediate value representing the dot product attention distribution, the intermediate value and the global prior attention matrix are subjected to weighted summation, and the weighted summation value and the value vector matrix are multiplied to obtain a third vector, and the third vector is used as output after the attention mechanism is merged. Therefore, the attention distribution which introduces the prior attention can be obtained, namely, the more concentrated attention distribution is obtained, the generalization capability of the model can be effectively improved, the network overfitting caused by individual differences of users is prevented, and the prediction performance of the terminal machine change prediction model based on the Transformer is further improved.

In an exemplary embodiment, the embodiment of the application further provides a method for predicting terminal change, which includes: and inputting the user data of the user to be predicted into the terminal machine-changing prediction model trained by the method to obtain the prediction probability value of the terminal machine-changing of the user to be predicted.

Illustratively, the method further comprises:

selecting the first N users of the predictive probability value according to the sorting direction from big to small, and determining the first N users as target groups of terminal machine changing service; wherein N is a natural number greater than 1.

The application is described in further detail below in connection with application examples.

In an application example, taking 5G terminal change service prediction as an example, the user data of each sample may include 20 feature data as shown in table 1.

TABLE 1

In this case, month DOU (Dataflow of usage) is the average monthly internet traffic per household, month ARPU (average revenue per user, average income per user) is an index used by operators to determine the income they get from each end user, and K12 (kindergarten through twelfth grade) is a term of education-class, an abbreviation for preschool education to high school education, and is commonly used to refer to basic education.

Fig. 3 shows an architecture diagram of a 5G terminal swap prediction model with a global prior attention mechanism fused in this application example, which mainly includes a first model for generating a global prior attention matrix and a second model of a transform structure variant module with a fused attention mechanism (i.e. the aforementioned terminal swap prediction model).

Specific implementations of the present application example are exemplarily described below in conjunction with the first model and the second model described above.

A first model: generating a global prior attention moment array

In order to acquire a global prior attention matrix, namely global correlation between every two features, a prediction task is firstly required to be constructed for each feature data, a feature prediction model for predicting the feature data by using other feature data can be constructed through a tree model, and the prior attention value of the feature data to the other feature data is obtained by using a shape value after model training is completed. And traversing each characteristic data to generate a global prior attention matrix. From the service point of view, the feature combination with extremely weak relevance from the service level is further filtered, and the concentration of the attention map is enhanced. The input of the module only comprises the characteristic data of the 5G terminal switching service, and the global prior attention matrix among the characteristics is output. The method specifically comprises the following steps:

1) Construction prediction task

A prediction task may be sequentially constructed for each of the feature data in the above table 1, for example, for the feature data "currently used terminal type", since the value thereof is "intelligent machine/non-intelligent machine", it belongs to a category variable, and the prediction task of the feature data is a classification task. If there is some piece of sample data: { "currently used terminal type": "intelligent machine", "current terminal use time length": 585, "age": 28, "terminal price": 5899, "number of called calls in the month": the term "whether a 5G terminal is newly added": "yes" }, a training sample of the predictive task from which the sample data can be derived for the feature "type of terminal currently in use" is shown in table 2.

TABLE 2

Similarly, the same number of predicted task samples may be derived from other raw samples. For continuous feature data "current terminal in use time", a regression task belonging to the feature data may be constructed as well, and other feature data are similarly processed, not illustrated herein.

2) Obtaining a global prior attention matrix according to shape values

For each prediction task, a tree model, such as an xgboost model, is selected to fit the training data, and a label encoder is required to encode each type of feature in the training data to meet the requirements of the xgboost model on the training data. After each prediction task is trained by the model, a shape library in python can be used for instantiating treeexplatiner objects, and shape values of feature data under the task are obtained by calling shape_values () functions, wherein the shape values reflect influence of each feature data on virtual tags. Since the feature data has a respective shape value in each sample, and the shape value may be positive or negative, the expectation of the absolute value of the shape value of each feature data over all samples represents the overall impact of the feature data on the virtual tag, which is equivalent to the attention value of the virtual tag on each of the remaining feature data. Virtual tag X _m For characteristic data X _i The value of the term (c) can be expressed as the following formula (2):

wherein, shape V _i,j Represented in virtual tag X _m In sample j of predictive task, characteristic data X _i For virtual tag X _m Is a function of the influence of (a) on the influence of (b) on the influence.

By modeling the prediction task of each feature data, the attention value of each feature data to the rest feature data can be obtained, but the shape value of the feature data cannot be obtained, so that a global prior attention matrix with empty diagonal elements can be obtained by the shape values of all the prediction tasks. K elements which are empty on the diagonal line of the matrix can be set as k learnable parameters, the learnable parameters are graded through a loss function in the model training stage, and the parameters are updated and learned by back propagation.

3) Filtering out feature combinations with extremely weak relevance from the service level

In the 5G terminal service, a strong correlation exists naturally between certain feature combinations from the service meaning, such as various traffic features: "DOU on month", "video SID traffic", "music class SID traffic", "short video SID traffic"; while some feature combinations have very weak correlations as shown in table 3.

TABLE 3 Table 3

Characteristic data 1	Feature data 2	Correlation strength
			Currently used terminal type	Terminal price	Strong association
Current terminal brand	Currently used terminal type	Strong association
			Duration in web	Age of	Strong association
Telephone charge balance	Last time interval of recharging	Stronger association
			Days of communication silencing	Age of	Stronger association
Whether or not to turn on broadband	Whether to move the purchasing machine user	Stronger association
			Current terminal brand	ARPU of the current month	Extremely weak correlation
Duration in web	Whether to move the purchasing machine user	Extremely weak correlation
			Telephone charge balance	Age of	Extremely weak correlation
Main body package voice	Age of	Extremely weak correlation
			Whether or not to turn on broadband	Time length of current terminal use	Extremely weak correlation

The business priori information can make attention more concentrated, so that the features with strong relevance are focused mutually, and more effective feature representation is learned. From the mathematical operation level, the attention score between the features with extremely weak relevance needs to be set as minus infinity, and the attention score between other features is kept unchanged, and the final global prior attention matrix is denoted as GPA (global prior attention).

And (3) a second model: transformer structure variant module integrating attention mechanism

The transformation structure variant module fused with the prior attention mechanism designs a complete transformation network architecture suitable for predicting the structured data by using a 5G terminal machine change, converts all one-dimensional characteristic data into multiple dimensions in an Embedding Layer of a network to adapt to a calculation mechanism of the transformation network, effectively fuses a global prior attention matrix output by a first model with standard dot product attention distribution, improves the concentration of the attention diagram and enhances the generalization capability of the model. In the model training stage, the input of the module is the characteristic data and the label data of the user of the 5G terminal machine-changing prediction task; in the model application stage, the characteristic data of the user is input, and the predicted probability value of the user belonging to the 5G terminal machine change potential customer is output. The specific process is as follows:

1) The input data passes through an Embedding Layer to obtain an Embedding representation of the feature

In the structured data, the feature data can be generally classified into continuous type and category type variables according to types, and each feature data is one-dimensional, so that in order to enable the feature data to perform interaction between tokens in a similar text and image transformation model, one-dimensional feature data in the structured data needs to be converted into multiple dimensions, and the dimension is assumed to be e. The feature data in this application example is converted into 30 dimensions.

Suppose that k feature data of the 5G terminal change prediction task contains c continuous feature variables and d category feature variables. In an Embedding Layer, for each continuous feature X ^cont _i Defining a learnable weight vectorWhen X of a sample ^cont _i Take the value x _i The expression of ebedding can be obtained asThe weight vectors of c continuous characteristics are spliced to form a Matrix of the learnable weight ^cont The size is c×e. In pyrach, torch.einsum (' jk, ij->ijk', A, B) simultaneously gets the ebedding representation of the batch_size c features, where A is the Matrix of learnable weights ^cont B is a matrix of size batch_size×c, representing input data containing only consecutive features. Meanwhile, for the category type feature, each category type feature X is assumed ^cat _i There is->Different values, all kinds of features share +.>Different values of the original +.>The data of different categories are converted into data from 0 to +.>Defining a size of +.>Each different value obtains the corresponding mapping table according to the index, and the mapping table can be obtained by +.>A learnable mapping table is obtained.

2) LayerNorm normalization of data

Through the above-mentioned coding Layer, each feature data x _i Vector that has been changed from one-dimensional scalar to e-dimensional(i.e. the aforementioned firstVector). LayerNorm normalizes for different features of a single sample, for a certain feature vector of a single sample>After a LayerNorm normalization operation, a second vector is obtained, which can be expressed as:

wherein g _i Is a scaling factor, is a learnable scalar parameter, and each feature vector x _i Corresponds to a learnable scalar parameter g _i The method comprises the steps of carrying out a first treatment on the surface of the μ and σ are the mean and standard deviation of the elements of each dimension of the casting, respectively, calculated as follows:

3) Calculating a standard dot product attention profile

The single-head attention mechanism module of the standard transducer operates as follows:

in the attention mechanism module, each head contains three learnable weight matrices W ^Q 、W ^K 、W ^V Vectors subjected to the above LayerNorm normalization operationSequentially multiplying the weight matrix to obtain the projected query vector +.>Key vector->Value directionQuantity->When calculating the characteristic x _i For the attention value of other features, feature x is used first _i Is +.>In turn with other features x _j Key vector of->Performing dot product operation, dividing by square root of query vector dimension, and performing softmax normalization to obtain attention value a _i ＝(a _i,1 ,a _i,2 ,...,a _i,k ). The value vectors are weighted and summed with the attention value to obtain the final output +.>In the formula (5), the matrix Q is that each feature matrix is respectively subjected to a weight matrix W ^Q The projected query vector is spliced to form a matrix form, namely a query vector matrix; the matrix K is the weight matrix W passed by each feature matrix ^K The projected key vectors are spliced to form a matrix form, namely a key vector matrix; the matrix V is that each characteristic matrix respectively passes through the weight matrix W ^V The projected value vectors are spliced to form a matrix form, namely a value vector matrix. In actual operation, high-efficiency operation is realized through matrix multiplication of the feature matrix and the weight matrix. Since the subsequent fusion of the attention profile is also required, only the intermediate value of the attention profile is calculated in this step +. >In the present application example, the size of the weight matrix is 30×30, i.e. d is described above _k The value is 30. In the application example, a single-head attention mechanism is adopted, and the application can be freely expanded to multiple heads according to the needs.

4) Attention mechanism fusion

Obtaining a global prior attention matrix GPA in step 3) of the first model and a dot product based attention distribution in step 3) of the second model, the fused attention distribution of the global prior attention moment matrix and the local dot product attention distribution may be expressed as:

wherein alpha is a learnable parameter, the influence weight of the local dot product attention distribution and the global prior attention matrix is adjusted, the alpha value range is (0, 1), the smaller the alpha value is, the larger the action of the global prior attention is, the alpha value is recommended to be selected near 0.5, and the proper alpha value can be selected according to the model effect in practical application. Through the operation of the softmax function, the attention score (namely the attention value) is set to be minus infinity in the step 3) of the first model, and the final attention score between the features becomes 0, so that a more concentrated attention distribution is obtained, and the generalization capability of the model can be effectively improved. The final output of the fused attention mechanism is:

5) Residual ligation and LayerNorm normalization

Each feature x _i The Embedding vector after the Embedding Layer operation in the step 1) of the second model(i.e., the first vector described above) and the output after the calculation of the fused attention mechanism ∈>(i.e. the aforementioned third vector) to obtain +.>(i.e., the fourth vector described above), and again LayerNorm normalization to obtain +.>(i.e., the fifth vector described above), the specific method of normalization is similar to the step 2) of the second model described above, and will not be described again here.

6) Feedforward neural network operation

The feedforward neural network FFN consists of two fully-connected layers, an activation function in the middle of the two fully-connected layers is a ReLU activation function, the first fully-connected layer projects the features to a higher dimension h, the second layer projects the features to an original dimension e, and the operation of the feedforward neural network can be expressed as follows:

the feedforward neural network respectively transforms each characteristic and is obtained by LayerNorm normalizationTransformation by feedforward neural network to obtain +.>(i.e., the sixth vector described above). In the application example, the number of neurons of the first layer of the full-connection layer is 64, the number of neurons of the second layer of the full-connection layer is 30, and the weight matrix W can be learned ₁ And W is ₂ The size of the feed-forward neural network is 30 multiplied by 64 and 64 multiplied by 30 respectively, the input of the feed-forward neural network is 30-dimensional, the dimension is converted into 64 after passing through the first full-connection layer, and then the dimension is reduced into 30 after passing through the second full-connection layer. The transform architecture of the application example only uses one block, does not need stacking operation, and can be freely expanded to a plurality of blocks according to the requirement.

7) Residual connection and dimension transformation

Step 5) of the second model before LayerNorm normalization(i.e. the fourth vector described above) and the output in step 6)>(i.e. the sixth vector mentioned above) to obtain +.>In the above steps, each feature is transformed separately and, in view of the fact that the prediction of the target task depends on each feature, it is finally necessary to connect all the transformed features end to end, so as to obtain a vector x' with dimension k×e (i.e. the seventh vector described above). In pytorch, it is possible toFunction implementation, and->All->Matrix combination of vectors.

8) MLP network operation

The MLP (Multi-Layer Perceptron) network consists of three fully connected layers, the number of neurons in the input Layer is k×e, the number of neurons in the first hidden Layer is h1, the number of neurons in the second hidden Layer is h2, the number of neurons in the output Layer is 1, and the operation of the MLP network can be expressed as:

MLP(x′)＝Sigmoid(SELU(SELU(x′W ₁ ′+b′ ₁ )W ₂ ′+b′ ₂ )W ₃ ′+b′ ₃ ) (9)

wherein W is ₁ ′、W ₂ ′、W ₃ 'is a weight matrix, b' ₁ 、b′ ₂ 、b′ ₃ For a learnable bias coefficient, in the present application example, h1 is 512, h2 is 32, and the weight matrix W ₁ ' has a size of k×30×512, and a weight matrix W ₂ ' has a size of 512×32, and a weight matrix W ₃ The size of' is 32 x 1.

It will be appreciated that, based on the first model and the second model described above, when model training is performed, a global prior attention matrix may be obtained from input data, based on a tree model, waiting for a subsequent step to be effectively fused with a standard dot product attention distribution.

Before training a neural network model, preprocessing input data, specifically comprising the steps of respectively carrying out min_max normalization operation on each continuous feature to eliminate the influence of dimension inconsistency on the neural network, carrying out unified coding on different values of all types of features, and converting the value of a tag 'whether a 5G terminal is newly added' into 0 or 1. According to the above steps, the data is input into the neural network and then transmitted forward to obtain the predicted value y of whether the 5G terminal is newly added ^p And calculating the loss with a real label y and adopting gradient descent to minimize a loss function, wherein the loss function adopts cross entropy due to the fact that the label belongs to a binary value. After iterative training, the neural network learns the optimal parameters.

It can be understood that before 5G terminal machine-changing marketing is performed, the latest characteristic data in the current month can be input into the model, the probability value of whether the model belongs to the 5G terminal machine-changing potential passenger is obtained, the probability values are arranged according to the descending order of the probability values, and topN crowd with the highest probability value is selected for precise marketing.

In order to implement the method of the embodiment of the present application, the embodiment of the present application further provides a training device for a terminal change prediction model, where the training device for a terminal change prediction model corresponds to the training method for a terminal change prediction model, and each step in the embodiment of the training method for a terminal change prediction model is also completely applicable to the embodiment of the training device for a terminal change prediction model.

As shown in fig. 4, the training device for the terminal change prediction model includes: a preprocessing module 401, a global a priori attention generation module 402, and a training module 403. The preprocessing module 401 is configured to perform data preprocessing on a sample data set, where the sample data set includes a preset number of sample data, and each sample data includes: user data related to mobile communication and label information of a user, wherein the label information is used for representing whether the user is a user for terminal machine changing in a set time; the global prior attention generation module 402 is configured to generate a global prior attention matrix based on the user data after the data preprocessing; the training module 403 is configured to train a terminal change prediction model based on the sample data set after the data preprocessing, to obtain a trained terminal change prediction model; the terminal machine change prediction model adopts a model which is fused into the global priori attention matrix in a transducer.

Illustratively, the global a priori attention generation module 402 is specifically configured to:

Illustratively, the global a priori attention generation module 402 is further configured to:

Illustratively, training module 403 is specifically configured to:

Illustratively, the training module 403 trains a terminal change prediction model based on the sample data after the data preprocessing, and obtains a predicted value of the terminal change prediction model, including:

In practical application, the preprocessing module 401, the global prior attention generation module 402 and the training module 403 may be implemented by a processor in a training device of the terminal exchange prediction model. Of course, the processor needs to run a computer program in memory to implement its functions.

It should be noted that: the training device for the terminal change prediction model provided in the above embodiment only uses the division of the program modules to illustrate when training the terminal change prediction model, and in practical application, the processing allocation may be completed by different program modules according to needs, that is, the internal structure of the device is divided into different program modules to complete all or part of the processing described above. In addition, the training device of the terminal change prediction model provided in the above embodiment and the training method embodiment of the terminal change prediction model belong to the same concept, and the specific implementation process of the training device is detailed in the method embodiment, which is not described herein again.

In order to implement the method of the embodiment of the present application, the embodiment of the present application further provides a terminal change prediction device, where the terminal change prediction device corresponds to the terminal change prediction method, and each step in the terminal change prediction method embodiment is also completely applicable to the terminal change prediction device embodiment.

As shown in fig. 5, the terminal change prediction apparatus includes: the prediction module 501 is configured to input user data of a user to be predicted into a terminal change prediction model obtained by training by the training device according to the embodiment of the present application, so as to obtain a prediction probability value of terminal change of the user to be predicted.

The terminal change prediction device further includes: a determining module 502, configured to sort prediction probability values of a plurality of the users to be predicted; selecting the first N users of the predictive probability value according to the sorting direction from big to small, and determining the first N users as target groups of terminal machine changing service; wherein N is a natural number greater than 1.

In practical applications, the prediction module 501 and the determination module 502 may be implemented by a processor in a prediction device of a terminal exchange. Of course, the processor needs to run a computer program in memory to implement its functions.

Based on the hardware implementation of the program module, and in order to implement the method of the embodiment of the application, the embodiment of the application also provides training equipment of the terminal machine-changing prediction model. Fig. 6 shows only an exemplary structure of the training apparatus of the terminal exchange prediction model, but not all the structure, and a part or all of the structure shown in fig. 6 may be implemented as needed.

As shown in fig. 6, a training device 600 for a terminal change prediction model according to an embodiment of the present application includes: at least one processor 601, a memory 602, a user interface 603 and at least one network interface 604. The various components in the training apparatus 600 of the terminal exchange prediction model are coupled together by a bus system 605. It is understood that the bus system 605 is used to enable connected communications between these components. The bus system 605 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 605 in fig. 6.

The user interface 603 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.

The memory 602 in embodiments of the present application is used to store various types of data to support the operation of the training device of the terminal change prediction model. Examples of such data include: any computer program for operating on a training device of a terminal swap predictive model.

The training method of the terminal machine change prediction model disclosed by the embodiment of the application can be applied to the processor 601 or realized by the processor 601. The processor 601 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the training method of the terminal change prediction model may be completed by an integrated logic circuit of hardware in the processor 601 or an instruction in a software form. The processor 601 may be a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 601 may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the application can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software module may be located in a storage medium, where the storage medium is located in the memory 602, and the processor 601 reads information in the memory 602, and combines with hardware to implement the steps of the training method of the terminal change prediction model provided by the embodiment of the present application.

In an exemplary embodiment, the training device of the terminal swap predictive model may be implemented by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable Logic Device), field programmable gate arrays (FPGAs, field Programmable Gate Array), general purpose processors, controllers, microcontrollers (MCUs, micro Controller Unit), microprocessors (microprocessors), or other electronic components for performing the aforementioned methods.

Based on the hardware implementation of the program modules, and in order to implement the method of the embodiment of the present application, the embodiment of the present application further provides a terminal change prediction device. Fig. 7 shows only an exemplary structure of the prediction apparatus of the terminal exchange, but not all the structure, and a part or all of the structure shown in fig. 7 may be implemented as needed.

As shown in fig. 7, a prediction device 700 for terminal exchange according to an embodiment of the present application includes: at least one processor 701, memory 702, a user interface 703, and at least one network interface 704. The various components in the predictive device 700 for a terminal exchange are coupled together by a bus system 705. It is to be appreciated that the bus system 705 is employed to facilitate connection communications between these components. The bus system 705 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration, the various buses are labeled as bus system 705 in fig. 7.

The user interface 703 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.

The memory 702 in embodiments of the present application is used to store various types of data to support the operation of the predictive device for terminal exchanges. Examples of such data include: any computer program for operating on a prediction device of a terminal exchange.

The prediction method for terminal exchange disclosed by the embodiment of the application can be applied to the processor 701 or realized by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In the implementation process, the steps of the prediction method of terminal exchange may be completed by an integrated logic circuit of hardware in the processor 701 or an instruction in a software form. The processor 701 may be a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 701 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the application can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software module may be located in a storage medium, where the storage medium is located in a memory 702, and the processor 701 reads information in the memory 702, and in combination with hardware, completes the steps of the terminal change prediction method provided in the embodiment of the present application.

In an exemplary embodiment, the terminal change prediction device 700 may be implemented by one or more ASIC, DSP, PLD, CPLD, FPGA, general-purpose processors, controllers, MCU, microprocessor, or other electronic elements for performing the foregoing methods.

It is to be appreciated that the memory 602, 702 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Wherein the nonvolatile Memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read Only Memory (EEPROM, electrically Erasable Programmable Read-Only Memory), magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk Read Only Memory (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, sync Link Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory described by embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

In an exemplary embodiment, the present application further provides a computer storage medium, specifically a computer readable storage medium, for example, including a memory 602 storing a computer program, where the computer program is executable by a processor 601 of a training device of a terminal exchange prediction model to complete the steps described in the method of the embodiment of the present application; as another example, a memory 702 storing a computer program executable by the processor 701 of the prediction device of the terminal exchange to perform the steps described in the method according to the embodiment of the present application is included. The computer readable storage medium may be ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

It should be noted that: "first," "second," etc. are used to distinguish similar objects and not necessarily to describe a particular order or sequence.

In addition, the embodiments of the present application may be arbitrarily combined without any collision.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. The training method of the terminal machine change prediction model is characterized by comprising the following steps of:

2. The method of claim 1, wherein the user data comprises a plurality of feature data associated with a terminal switch service, wherein the generating a global a priori attention matrix based on the data-preprocessed user data comprises:

3. The method according to claim 2, wherein the method further comprises:

4. The method according to claim 1, wherein training a terminal change prediction model based on the data-preprocessed sample data set to obtain a trained terminal change prediction model comprises:

5. The method of claim 4, wherein training a terminal change prediction model based on the data-preprocessed sample data and obtaining a predicted value of the terminal change prediction model comprises:

6. The terminal change prediction method is characterized by comprising the following steps:

inputting user data of a user to be predicted into a terminal machine-changing prediction model trained by the method according to any one of claims 1 to 5 to obtain a prediction probability value of terminal machine-changing of the user to be predicted.

7. The method of claim 6, wherein the method further comprises:

wherein N is a natural number greater than 1.

8. The utility model provides a training device of terminal machine changing prediction model which characterized in that includes:

9. A terminal change prediction apparatus, comprising:

the prediction module is configured to input user data of a user to be predicted into the terminal change prediction model trained by the training device according to claim 8, so as to obtain a prediction probability value of terminal change of the user to be predicted.

10. A training device for a terminal change prediction model, comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein,

the processor being adapted to perform the steps of the method of any of claims 1 to 5 when the computer program is run.

11. A terminal change prediction apparatus, comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein,

the processor being adapted to perform the steps of the method of any of claims 6 to 7 when the computer program is run.

12. A computer storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method according to any of claims 1 to 7.