CN112036572B

CN112036572B - Text list-based user feature extraction method and device

Info

Publication number: CN112036572B
Application number: CN202010889481.1A
Authority: CN
Inventors: 顾凌云; 谢旻旗; 段湾; 陈尚伟; 张涛; 潘峻
Original assignee: Shanghai IceKredit Inc
Current assignee: Shanghai IceKredit Inc
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2024-03-12
Anticipated expiration: 2040-08-28
Also published as: CN112036572A

Abstract

The embodiment of the application provides a text list-based user feature extraction method and device, which are characterized in that application installation list information of a user is converted into a text information list, the text information list is converted into topic feature vectors through an LDA topic model, and then the extracted topic feature vectors are input into a target wind control model to carry out decision output on the target wind control model. Therefore, the feature interpretability of the topic feature vector output by the LDA topic model is stronger, the probability of the topic label corresponding to each preset feature dimension to which the user belongs can be reflected, the information maintenance and update cost is saved, meanwhile, the dimension of the topic feature vector is lower, the topic feature vector can be manually specified, the dimension disaster problem caused by the overhigh feature dimension is avoided, and the follow-up wind control model can exert better performance.

Description

Text list-based user feature extraction method and device

Technical Field

The application relates to the technical field of computer wind control, in particular to a text list-based user feature extraction method and device.

Background

In the existing wind control business scene, enterprises can construct machine learning models by using user data as much as possible, and the cost of user data storage and model deployment is reduced as much as possible on the premise of maximizing model performance. Based on the information, the application installation information of the user mobile equipment plays an important role in improving the performance of the enterprise wind control model. In conventional schemes, relevant feature vectors are generally extracted from application installation information of a user mobile device, and then input into a subsequent wind control model to participate in final calculation and decision. However, the conventional scheme has high information maintenance and update cost, low feature accuracy and adverse to data storage of computer equipment, and the wind control model often generates a dimension disaster problem when the wind control model is input as a follow-up wind control model.

Disclosure of Invention

Based on the defects of the existing design, the user feature extraction method and device based on the text list are provided, feature interpretability of the topic feature vector output through the LDA topic model is stronger, probability of topic labels corresponding to each preset feature dimension of a user can be reflected, information maintenance and update cost is saved, meanwhile, the dimension of the topic feature vector is lower, the topic feature vector can be manually specified, dimension disaster problems caused by overhigh feature dimension are avoided, and the follow-up wind control model can exert better performance.

According to a first aspect of embodiments of the present application, there is provided a text list-based user feature extraction method, applied to a computer device, the method including:

acquiring application installation list information of a user from a user terminal;

converting the user installation list information into a text information list, and converting the text information list into topic feature vectors through an LDA topic model, wherein the topic feature vectors are feature vectors with a plurality of preset dimensions, and the feature vector with each preset feature dimension is used for representing the probability that the user belongs to a topic label corresponding to the preset feature dimension;

and inputting the extracted topic feature vector into a target wind control model, and outputting the decision of the target wind control model.

In a possible implementation manner of the first aspect, the step of converting the user installation list information into a text information list and converting the text information list into a topic feature vector through an LDA topic model includes:

converting the user installation list information into a text information list, and acquiring application program identification information corresponding to each installation package name in the text information list;

and determining a corresponding keyword vector according to the application program identification information corresponding to each installation package name, and inputting the keyword vector into a pre-trained LDA topic model to obtain a plurality of feature vectors with preset dimensions as the topic feature vectors.

In a possible implementation manner of the first aspect, the LDA topic model is trained by:

acquiring user installation list information of a plurality of users collected in advance, acquiring application program identification information corresponding to each installation package name from the user installation list information, and converting the user installation list information into an application program identification list;

segmenting the application program identification list to obtain a keyword list of each user;

traversing the keyword list of each user, and constructing a corresponding keyword dictionary by using all the keywords;

converting the keyword list of each user into keyword vectors according to the constructed keyword dictionary, forming training samples by the keyword vectors of all the users, and training according to the preset topic number and the training samples to obtain an LDA topic model;

in a possible implementation manner of the first aspect, the method further includes:

counting the occurrence frequency of each keyword in the constructed keyword dictionary in the keyword list of all users;

and recoding the constructed keyword dictionary according to the occurrence frequency of each keyword in the keyword list of all users to obtain a recoded keyword dictionary, and executing the step of converting the keyword list of each user into a keyword vector according to the constructed keyword dictionary based on the recoded keyword dictionary.

In a possible implementation manner of the first aspect, the step of inputting the extracted topic feature vector into a target wind control model to make a decision output for the target wind control model includes:

inputting the extracted topic feature vector into a target wind control model, and matching the topic feature vector based on a wind control matching rule in the target wind control model to obtain a decision output result.

According to a second aspect of embodiments of the present application, there is provided a text list-based user feature extraction apparatus, applied to a computer device, the apparatus including:

the acquisition module is used for acquiring application installation list information of a user from the user terminal;

the conversion module is used for converting the user installation list information into a text information list and converting the text information list into topic feature vectors through an LDA topic model, wherein the topic feature vectors are feature vectors with a plurality of preset dimensions, and the feature vector with each preset feature dimension is used for representing the probability that the user belongs to a topic label corresponding to the preset feature dimension;

the decision output module is used for inputting the extracted topic feature vector into a target wind control model and outputting the decision of the target wind control model.

Based on any one of the above aspects, the application installation list information of the user is converted into a text information list, the text information list is converted into topic feature vectors through the LDA topic model, and then the extracted topic feature vectors are input into the target wind control model to carry out decision output on the target wind control model. Therefore, the feature interpretability of the topic feature vector output by the LDA topic model is stronger, the probability of the topic label corresponding to each preset feature dimension to which the user belongs can be reflected, the information maintenance and update cost is saved, meanwhile, the dimension of the topic feature vector is lower, the topic feature vector can be manually specified, the dimension disaster problem caused by the overhigh feature dimension is avoided, and the follow-up wind control model can exert better performance.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a text list-based user feature extraction method according to an embodiment of the present application;

fig. 2 is a schematic functional block diagram of a text list-based user feature extraction device according to an embodiment of the present application;

fig. 3 is a schematic component structure diagram of a test terminal for executing the text list-based user feature extraction method according to the embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the accompanying drawings in the present application are only for the purpose of illustration and description, and are not intended to limit the protection scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this application, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art.

As the technical problems known in the foregoing background art, in the conventional design, when extracting feature vectors to make a wind control model decision, the method mainly includes two types, namely a user feature extraction method based on rules and templates and a user feature extraction method based on TF-IDF (term frequency-inverse text frequency index).

For example, rule and template based user feature extraction methods generally include the steps of:

first, application installation list information of a user recorded in a user terminal is acquired: for example [ 'com.sdu.di.gui', 'com.meiuan.qcs.c. android', 'com.sankuai.meiuan.takeoutnew' ];

then, according to the application name corresponding to the installation package name in the application installation list information, converting the application installation list of the user into an application program identification list, wherein the corresponding examples are [ 'drip playing', 'beautiful group takeaway' ];

then, rules and templates are manually formulated, for example, [ the user contains the number of applications of the "taxi" typeface, the user contains the ratio of the applications of the "take away" typeface to all applications of the user ];

then, according to the rules and templates established by human, the application program identification list information of the user can be converted into feature vectors, and the corresponding example is [2,0.33];

therefore, the extracted feature vector can be input into a subsequent wind control model to participate in calculation and decision.

In addition, the TF-IDF based user feature extraction method generally includes the steps of:

then, the application program identification list of the user is segmented, and stop words with higher occurrence frequency are removed at the same time, so that a keyword list of the user is obtained, namely [ 'drip', 'get in the car', 'take out' and [ 'king', 'glory', 'get in the face of the group', 'take out' ];

then, traversing the keyword list of each user to construct a dictionary containing all the words appearing, {1: ' drip }, 2: ' get on the car ',3: ' Mei group ',4: ' take away ',5: ' king ',6: ' glory ' };

therefore, the keyword list of the user can be counted according to a preset TF-IDF formula and a built dictionary, so that the standardized word frequency vectors of the two users in the example are [0.167,0.333,0.333,0.167,0,0] and [0,0,0.25,0.25,0.25,0.25] respectively, the inverse document frequency vector of each keyword in the dictionary is [0, -0.176, -0.176,0,0], and the characteristic vectors of the two users are [0, -0.059, -0.029,0,0] and [0, -0.044, -0.044,0,0] respectively after the standardized word frequency vector of each user and the inverse document frequency vector are multiplied element by element;

finally, the extracted feature vector can be input into a subsequent wind control model to participate in calculation and decision.

According to the research of the inventor, the method for extracting the user characteristics based on the rules and the templates needs to manually design a large number of rules and templates, and when user data change, the manually designed rules and templates often need to be maintained and changed, so that a large amount of effort is consumed. Meanwhile, the method has higher subjective factors, and is not necessarily effective when being actually applied to a subsequent wind control model based on machine learning, so that the accuracy of the feature vector and the accuracy of a subsequent wind control decision are affected.

In addition, the dimension of the feature vector extracted by the user feature extraction method based on the TF-IDF is usually very high, a very large sparse matrix can be generated, the data storage of computer equipment is not facilitated, the problem that the dimension disaster is generated by the model when the model is input as a follow-up wind control model is also frequently caused, and most machine learning algorithms and models cannot adapt to the sparse matrix with the very large dimension.

Based on the above technical problems, the inventor of the present application has creatively studied to propose the following scheme, and referring to fig. 1, fig. 1 shows a flow chart of a text list-based user feature extraction method provided in an embodiment of the present application, and it should be understood that, in other embodiments, part of steps in the text list-based user feature extraction method of the present embodiment may be interchanged according to actual needs, or part of steps may be omitted or deleted. The detailed steps of the text list-based user feature extraction method are described below.

Step S110, acquiring application installation list information of a user from a user terminal.

Step S120, converting the user installation list information into a text information list, and converting the text information list into topic feature vectors through an LDA topic model.

And step S130, inputting the extracted topic feature vector into a target wind control model, and outputting the decision of the target wind control model.

In this embodiment, the topic feature vector is specifically a feature vector of a plurality of preset dimensions, and the feature vector of each preset feature dimension may be used to represent the probability that the user belongs to the topic label corresponding to the preset feature dimension. The preset feature dimension may be flexibly selected by the user according to the actual design requirement, which is not specifically limited herein.

Based on the design, the feature interpretability of the topic feature vector output by the LDA topic model is stronger, the probability of the topic label corresponding to each preset feature dimension of the user can be reflected, the information maintenance and update cost is saved, meanwhile, the dimension of the topic feature vector is lower, the topic feature vector can be manually specified, the dimension disaster problem caused by the overhigh feature dimension is avoided, and the follow-up wind control model can exert better performance.

In one possible implementation, for step S120, this may be achieved by the following exemplary sub-steps, described in detail below.

And step S121, converting the user installation list information into a text information list, and acquiring application program identification information corresponding to each installation package name in the text information list.

In the substep S122, a corresponding keyword vector is determined according to the application program identification information corresponding to each installation package name, and the keyword vector is input into the pre-trained LDA topic model, so as to obtain feature vectors with a plurality of preset dimensions, and the feature vectors are used as topic feature vectors.

As one possible example, the above LDA topic model can be trained in the following manner, as described in detail below.

Firstly, acquiring user installation list information of a plurality of users collected in advance, acquiring application program identification information corresponding to each installation package name from the user installation list information, and converting the user installation list information into an application program identification list. For example, the user installation list information may be: the application identification list may be [ 'drip taxi', 'bolus take-out' and [ 'king' honor, 'com.savart. Tsu.tsu.sgaw' ] and [ 'com.tent.tmgp.sgaw' ], corresponding to the above examples.

Then, the keyword list of each user is obtained by word segmentation on the application identification list.

In detail, since spaces naturally exist between english words as separators, but spaces do not exist between chinese words, in order to obtain a phrase in an application identification list, the application identification list of a user needs to be segmented to obtain a keyword list of the user, and the examples are [ 'drip', 'get car', 'get away' and [ 'king', 'glory', 'get away' ].

On this basis, the keyword list of each user can be traversed, and a corresponding keyword dictionary is built by all the keywords which appear.

For example, the key dictionary constructed for the above example may be { 1} 'drop', 2 } 'get on the car', 3 } 'beauty group', 4 } 'take away', 5 } 'king', 6 } 'glory'.

And finally, converting the keyword list of each user into keyword vectors according to the constructed keyword dictionary, forming training samples by the keyword vectors of all the users, and training according to the preset topic number and the training samples to obtain the LDA topic model.

For example, the keyword vectors corresponding to the above example transformations may be [1,2,3,2,3,4] and [5,6,3,4], and then [1,2,3,2,3,4] and [5,6,3,4] are combined into training samples, and the LDA topic model is obtained according to the preset topic number and training samples.

In one possible implementation manner, after the keyword list of each user is traversed and the corresponding keyword dictionary is built by using all the keywords that appear, in order to improve accuracy and referenceability of the feature vector, the occurrence frequency of each keyword in the built keyword dictionary in the keyword list of all the users may be further counted. For example, corresponding to the above example, the frequency of occurrence of each keyword in the keyword list of all users is {1:1,2:2,3:3,4:2,5:1,6:1}, respectively.

Then, the constructed keyword dictionary may be recoded according to occurrence frequency of each keyword in the keyword list of all users, so as to obtain a recoded keyword dictionary, and the step of converting the keyword list of each user into a keyword vector according to the constructed keyword dictionary is performed based on the recoded keyword dictionary.

For example, according to the occurrence frequency of each keyword in the keyword list of all users, the ultralow frequency nonsensical vocabulary in the keyword dictionary can be deleted, and the keyword dictionary is recoded, for example, if the vocabulary with the deletion frequency of 1 is deleted, the recoded dictionary obtained according to the above example is {2 } 'taxi', 3 } 'beauty group', 4 } 'take away', 5: 'king', 6 }.

Further, in one possible implementation manner, for step S130, the extracted topic feature vector may be input into the target wind control model, and the topic feature vector is matched based on a wind control matching rule in the target wind control model, so as to obtain a decision output result. The wind control matching rules of different target wind control models are different, and can be selected according to actual service requirements, which is not limited in detail herein.

For example, if the target wind control model is a pre-credit approval wind control model, and the final decision output result is the credit score of the user, a threshold may be preset, and for users with credit scores above this threshold, loan approval may be provided, and if the target wind control model is a user with credit scores below the threshold, the loan approval is refused to be provided.

For another example, if the target wind control model is a wind control model monitored in a credit, the final decision output results feed the user if the user can pay on time or if there is an overdue risk.

For another example, if the target wind control model is a post-credit refund wind control model, and the users are overdue and unremoved users, it is possible to determine which users are those who have a high probability of repayment, and then select those people for the key refund.

Based on the same inventive concept, please refer to fig. 2, which shows a schematic diagram of functional modules of a text list-based user feature extraction device 110 according to an embodiment of the present application, where the functional modules of the text list-based user feature extraction device 110 may be divided according to the above-mentioned method embodiment. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation. For example, in the case of dividing the respective function modules with the respective functions, the text list-based user feature extraction apparatus 110 shown in fig. 2 is only one apparatus schematic diagram. The text-list-based user feature extraction device 110 may include an obtaining module 111, a converting module 112, and a decision output module 113, and the functions of each functional module of the text-list-based user feature extraction device 110 are described in detail below.

An obtaining module 111, configured to obtain application installation list information of a user from a user terminal. It is understood that the acquisition module 111 may be used to perform the step S110 described above, and reference may be made to the details of the implementation of the acquisition module 111 regarding the step S110 described above.

The conversion module 112 is configured to convert the user installation list information into a text information list, and convert the text information list into topic feature vectors through an LDA topic model, where the topic feature vectors are feature vectors with a plurality of preset dimensions, and the feature vector with each preset feature dimension is used to represent a probability that a user belongs to a topic label corresponding to the preset feature dimension. It is understood that the conversion module 112 may be used to perform the step S120 described above, and reference may be made to the details of the implementation of the conversion module 112 regarding the step S120 described above.

The decision output module 113 is configured to input the extracted topic feature vector into a target wind control model, and perform decision output on the target wind control model. It is understood that the decision output module 113 may be used to perform the step S130, and reference may be made to the details of the implementation of the decision output module 113 in the above description of the step S130.

In one possible implementation, the conversion module 112 is specifically configured to:

and determining a corresponding keyword vector according to the application program identification information corresponding to each installation package name, and inputting the keyword vector into a pre-trained LDA topic model to obtain a plurality of feature vectors with preset dimensions as topic feature vectors.

In one possible implementation, the LDA topic model is trained by:

dividing words from the application program identification list to obtain a keyword list of each user;

in one possible implementation, the conversion module 112 is specifically further configured to:

after traversing the keyword list of each user, constructing a corresponding keyword dictionary by using all the keywords which appear, and counting the occurrence frequency of each keyword in the constructed keyword dictionary in the keyword list of all the users;

In one possible implementation, the decision output module 113 is specifically configured to:

and inputting the extracted topic feature vector into a target wind control model, and matching the topic feature vector based on a wind control matching rule in the target wind control model to obtain a decision output result.

Referring to fig. 3, which is a schematic block diagram illustrating a structure of a computer device 100 for performing the text list-based user feature extraction method according to an embodiment of the present application, the computer device 100 may include a text list-based user feature extraction apparatus 110, a machine-readable storage medium 120, and a processor 130.

In this embodiment, the machine-readable storage medium 120 and the processor 130 are both located in the computer device 100 and are separately provided. However, it should be understood that the machine-readable storage medium 120 may also be separate from the computer device 100 and accessible by the processor 130 through a bus interface. In the alternative, machine-readable storage medium 120 may be integrated into processor 130, and may be, for example, a cache and/or general purpose registers.

The text-list based user feature extraction device 110 may include software functional modules (e.g., the acquisition module 111, the transformation module 112, and the decision output module 113 shown in fig. 2) stored on the machine-readable storage medium 120 to implement the text-list based user feature extraction method provided by the foregoing method embodiments when executed by the processor 130.

Since the computer device 100 provided in the embodiment of the present application is another implementation form of the method embodiment executed by the computer device 100, and the computer device 100 may be used to execute the text list-based user feature extraction method provided in the method embodiment, the technical effects obtained by the method embodiment may refer to the method embodiment and will not be described herein.

The embodiments described above are only some, but not all, of the embodiments of the present application. The components of the embodiments of the present application, as generally described and illustrated in the figures, may be arranged and designed in a wide variety of different configurations. Accordingly, the detailed description of the embodiments of the present application provided in the drawings is not intended to limit the scope of protection of the application, but is merely representative of selected embodiments of the application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims. Moreover, all other embodiments that can be made by a person skilled in the art, based on the embodiments of the present application, without making any inventive effort, shall fall within the scope of protection of the present application.

Claims

1. A text list-based user feature extraction method, for application to a computer device, the method comprising:

inputting the extracted topic feature vector into a target wind control model, and outputting a decision to the target wind control model;

the step of converting the user installation list information into a text information list and converting the text information list into a topic feature vector through an LDA topic model comprises the following steps:

determining a corresponding keyword vector according to the application program identification information corresponding to each installation package name, and inputting the keyword vector into a pre-trained LDA topic model to obtain a plurality of feature vectors with preset dimensions as the topic feature vectors;

the LDA topic model is trained by the following modes:

and converting the keyword list of each user into keyword vectors according to the constructed keyword dictionary, forming training samples by the keyword vectors of all the users, and training according to the preset topic number and the training samples to obtain an LDA topic model.

2. The text list-based user feature extraction method of claim 1, further comprising:

3. The text list-based user feature extraction method according to claim 1, wherein the step of inputting the extracted topic feature vector into a target wind control model to perform decision output on the target wind control model comprises the steps of:

4. A text list-based user feature extraction apparatus for use with a computer device, the apparatus comprising:

the decision output module is used for inputting the extracted topic feature vector into a target wind control model and outputting the decision of the target wind control model;

wherein, the conversion module is specifically used for:

the LDA topic model is trained by the following modes:

5. The text list based user feature extraction apparatus of claim 4, wherein the transformation module is further specifically configured to:

6. The text-list-based user feature extraction apparatus of claim 4, wherein the decision output module is specifically configured to: