CN112036572B - Text list-based user feature extraction method and device - Google Patents

Text list-based user feature extraction method and device Download PDF

Info

Publication number
CN112036572B
CN112036572B CN202010889481.1A CN202010889481A CN112036572B CN 112036572 B CN112036572 B CN 112036572B CN 202010889481 A CN202010889481 A CN 202010889481A CN 112036572 B CN112036572 B CN 112036572B
Authority
CN
China
Prior art keywords
list
keyword
user
topic
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010889481.1A
Other languages
Chinese (zh)
Other versions
CN112036572A (en
Inventor
顾凌云
谢旻旗
段湾
陈尚伟
张涛
潘峻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai IceKredit Inc
Original Assignee
Shanghai IceKredit Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai IceKredit Inc filed Critical Shanghai IceKredit Inc
Priority to CN202010889481.1A priority Critical patent/CN112036572B/en
Publication of CN112036572A publication Critical patent/CN112036572A/en
Application granted granted Critical
Publication of CN112036572B publication Critical patent/CN112036572B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • G06F3/0482Interaction with lists of selectable items, e.g. menus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a text list-based user feature extraction method and device, which are characterized in that application installation list information of a user is converted into a text information list, the text information list is converted into topic feature vectors through an LDA topic model, and then the extracted topic feature vectors are input into a target wind control model to carry out decision output on the target wind control model. Therefore, the feature interpretability of the topic feature vector output by the LDA topic model is stronger, the probability of the topic label corresponding to each preset feature dimension to which the user belongs can be reflected, the information maintenance and update cost is saved, meanwhile, the dimension of the topic feature vector is lower, the topic feature vector can be manually specified, the dimension disaster problem caused by the overhigh feature dimension is avoided, and the follow-up wind control model can exert better performance.

Description

Text list-based user feature extraction method and device
Technical Field
The application relates to the technical field of computer wind control, in particular to a text list-based user feature extraction method and device.
Background
In the existing wind control business scene, enterprises can construct machine learning models by using user data as much as possible, and the cost of user data storage and model deployment is reduced as much as possible on the premise of maximizing model performance. Based on the information, the application installation information of the user mobile equipment plays an important role in improving the performance of the enterprise wind control model. In conventional schemes, relevant feature vectors are generally extracted from application installation information of a user mobile device, and then input into a subsequent wind control model to participate in final calculation and decision. However, the conventional scheme has high information maintenance and update cost, low feature accuracy and adverse to data storage of computer equipment, and the wind control model often generates a dimension disaster problem when the wind control model is input as a follow-up wind control model.
Disclosure of Invention
Based on the defects of the existing design, the user feature extraction method and device based on the text list are provided, feature interpretability of the topic feature vector output through the LDA topic model is stronger, probability of topic labels corresponding to each preset feature dimension of a user can be reflected, information maintenance and update cost is saved, meanwhile, the dimension of the topic feature vector is lower, the topic feature vector can be manually specified, dimension disaster problems caused by overhigh feature dimension are avoided, and the follow-up wind control model can exert better performance.
According to a first aspect of embodiments of the present application, there is provided a text list-based user feature extraction method, applied to a computer device, the method including:
acquiring application installation list information of a user from a user terminal;
converting the user installation list information into a text information list, and converting the text information list into topic feature vectors through an LDA topic model, wherein the topic feature vectors are feature vectors with a plurality of preset dimensions, and the feature vector with each preset feature dimension is used for representing the probability that the user belongs to a topic label corresponding to the preset feature dimension;
and inputting the extracted topic feature vector into a target wind control model, and outputting the decision of the target wind control model.
In a possible implementation manner of the first aspect, the step of converting the user installation list information into a text information list and converting the text information list into a topic feature vector through an LDA topic model includes:
converting the user installation list information into a text information list, and acquiring application program identification information corresponding to each installation package name in the text information list;
and determining a corresponding keyword vector according to the application program identification information corresponding to each installation package name, and inputting the keyword vector into a pre-trained LDA topic model to obtain a plurality of feature vectors with preset dimensions as the topic feature vectors.
In a possible implementation manner of the first aspect, the LDA topic model is trained by:
acquiring user installation list information of a plurality of users collected in advance, acquiring application program identification information corresponding to each installation package name from the user installation list information, and converting the user installation list information into an application program identification list;
segmenting the application program identification list to obtain a keyword list of each user;
traversing the keyword list of each user, and constructing a corresponding keyword dictionary by using all the keywords;
converting the keyword list of each user into keyword vectors according to the constructed keyword dictionary, forming training samples by the keyword vectors of all the users, and training according to the preset topic number and the training samples to obtain an LDA topic model;
in a possible implementation manner of the first aspect, the method further includes:
counting the occurrence frequency of each keyword in the constructed keyword dictionary in the keyword list of all users;
and recoding the constructed keyword dictionary according to the occurrence frequency of each keyword in the keyword list of all users to obtain a recoded keyword dictionary, and executing the step of converting the keyword list of each user into a keyword vector according to the constructed keyword dictionary based on the recoded keyword dictionary.
In a possible implementation manner of the first aspect, the step of inputting the extracted topic feature vector into a target wind control model to make a decision output for the target wind control model includes:
inputting the extracted topic feature vector into a target wind control model, and matching the topic feature vector based on a wind control matching rule in the target wind control model to obtain a decision output result.
According to a second aspect of embodiments of the present application, there is provided a text list-based user feature extraction apparatus, applied to a computer device, the apparatus including:
the acquisition module is used for acquiring application installation list information of a user from the user terminal;
the conversion module is used for converting the user installation list information into a text information list and converting the text information list into topic feature vectors through an LDA topic model, wherein the topic feature vectors are feature vectors with a plurality of preset dimensions, and the feature vector with each preset feature dimension is used for representing the probability that the user belongs to a topic label corresponding to the preset feature dimension;
the decision output module is used for inputting the extracted topic feature vector into a target wind control model and outputting the decision of the target wind control model.
Based on any one of the above aspects, the application installation list information of the user is converted into a text information list, the text information list is converted into topic feature vectors through the LDA topic model, and then the extracted topic feature vectors are input into the target wind control model to carry out decision output on the target wind control model. Therefore, the feature interpretability of the topic feature vector output by the LDA topic model is stronger, the probability of the topic label corresponding to each preset feature dimension to which the user belongs can be reflected, the information maintenance and update cost is saved, meanwhile, the dimension of the topic feature vector is lower, the topic feature vector can be manually specified, the dimension disaster problem caused by the overhigh feature dimension is avoided, and the follow-up wind control model can exert better performance.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a text list-based user feature extraction method according to an embodiment of the present application;
fig. 2 is a schematic functional block diagram of a text list-based user feature extraction device according to an embodiment of the present application;
fig. 3 is a schematic component structure diagram of a test terminal for executing the text list-based user feature extraction method according to the embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the accompanying drawings in the present application are only for the purpose of illustration and description, and are not intended to limit the protection scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this application, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art.
As the technical problems known in the foregoing background art, in the conventional design, when extracting feature vectors to make a wind control model decision, the method mainly includes two types, namely a user feature extraction method based on rules and templates and a user feature extraction method based on TF-IDF (term frequency-inverse text frequency index).
For example, rule and template based user feature extraction methods generally include the steps of:
first, application installation list information of a user recorded in a user terminal is acquired: for example [ 'com.sdu.di.gui', 'com.meiuan.qcs.c. android', 'com.sankuai.meiuan.takeoutnew' ];
then, according to the application name corresponding to the installation package name in the application installation list information, converting the application installation list of the user into an application program identification list, wherein the corresponding examples are [ 'drip playing', 'beautiful group takeaway' ];
then, rules and templates are manually formulated, for example, [ the user contains the number of applications of the "taxi" typeface, the user contains the ratio of the applications of the "take away" typeface to all applications of the user ];
then, according to the rules and templates established by human, the application program identification list information of the user can be converted into feature vectors, and the corresponding example is [2,0.33];
therefore, the extracted feature vector can be input into a subsequent wind control model to participate in calculation and decision.
In addition, the TF-IDF based user feature extraction method generally includes the steps of:
first, application installation list information of a user recorded in a user terminal is acquired: for example [ 'com.sdu.di.gui', 'com.meiuan.qcs.c. android', 'com.sankuai.meiuan.takeoutnew' ];
then, according to the application name corresponding to the installation package name in the application installation list information, converting the application installation list of the user into an application program identification list, wherein the corresponding examples are [ 'drip playing', 'beautiful group takeaway' ];
then, the application program identification list of the user is segmented, and stop words with higher occurrence frequency are removed at the same time, so that a keyword list of the user is obtained, namely [ 'drip', 'get in the car', 'take out' and [ 'king', 'glory', 'get in the face of the group', 'take out' ];
then, traversing the keyword list of each user to construct a dictionary containing all the words appearing, {1: ' drip }, 2: ' get on the car ',3: ' Mei group ',4: ' take away ',5: ' king ',6: ' glory ' };
therefore, the keyword list of the user can be counted according to a preset TF-IDF formula and a built dictionary, so that the standardized word frequency vectors of the two users in the example are [0.167,0.333,0.333,0.167,0,0] and [0,0,0.25,0.25,0.25,0.25] respectively, the inverse document frequency vector of each keyword in the dictionary is [0, -0.176, -0.176,0,0], and the characteristic vectors of the two users are [0, -0.059, -0.029,0,0] and [0, -0.044, -0.044,0,0] respectively after the standardized word frequency vector of each user and the inverse document frequency vector are multiplied element by element;
finally, the extracted feature vector can be input into a subsequent wind control model to participate in calculation and decision.
According to the research of the inventor, the method for extracting the user characteristics based on the rules and the templates needs to manually design a large number of rules and templates, and when user data change, the manually designed rules and templates often need to be maintained and changed, so that a large amount of effort is consumed. Meanwhile, the method has higher subjective factors, and is not necessarily effective when being actually applied to a subsequent wind control model based on machine learning, so that the accuracy of the feature vector and the accuracy of a subsequent wind control decision are affected.
In addition, the dimension of the feature vector extracted by the user feature extraction method based on the TF-IDF is usually very high, a very large sparse matrix can be generated, the data storage of computer equipment is not facilitated, the problem that the dimension disaster is generated by the model when the model is input as a follow-up wind control model is also frequently caused, and most machine learning algorithms and models cannot adapt to the sparse matrix with the very large dimension.
Based on the above technical problems, the inventor of the present application has creatively studied to propose the following scheme, and referring to fig. 1, fig. 1 shows a flow chart of a text list-based user feature extraction method provided in an embodiment of the present application, and it should be understood that, in other embodiments, part of steps in the text list-based user feature extraction method of the present embodiment may be interchanged according to actual needs, or part of steps may be omitted or deleted. The detailed steps of the text list-based user feature extraction method are described below.
Step S110, acquiring application installation list information of a user from a user terminal.
Step S120, converting the user installation list information into a text information list, and converting the text information list into topic feature vectors through an LDA topic model.
And step S130, inputting the extracted topic feature vector into a target wind control model, and outputting the decision of the target wind control model.
In this embodiment, the topic feature vector is specifically a feature vector of a plurality of preset dimensions, and the feature vector of each preset feature dimension may be used to represent the probability that the user belongs to the topic label corresponding to the preset feature dimension. The preset feature dimension may be flexibly selected by the user according to the actual design requirement, which is not specifically limited herein.
Based on the design, the feature interpretability of the topic feature vector output by the LDA topic model is stronger, the probability of the topic label corresponding to each preset feature dimension of the user can be reflected, the information maintenance and update cost is saved, meanwhile, the dimension of the topic feature vector is lower, the topic feature vector can be manually specified, the dimension disaster problem caused by the overhigh feature dimension is avoided, and the follow-up wind control model can exert better performance.
In one possible implementation, for step S120, this may be achieved by the following exemplary sub-steps, described in detail below.
And step S121, converting the user installation list information into a text information list, and acquiring application program identification information corresponding to each installation package name in the text information list.
In the substep S122, a corresponding keyword vector is determined according to the application program identification information corresponding to each installation package name, and the keyword vector is input into the pre-trained LDA topic model, so as to obtain feature vectors with a plurality of preset dimensions, and the feature vectors are used as topic feature vectors.
As one possible example, the above LDA topic model can be trained in the following manner, as described in detail below.
Firstly, acquiring user installation list information of a plurality of users collected in advance, acquiring application program identification information corresponding to each installation package name from the user installation list information, and converting the user installation list information into an application program identification list. For example, the user installation list information may be: the application identification list may be [ 'drip taxi', 'bolus take-out' and [ 'king' honor, 'com.savart. Tsu.tsu.sgaw' ] and [ 'com.tent.tmgp.sgaw' ], corresponding to the above examples.
Then, the keyword list of each user is obtained by word segmentation on the application identification list.
In detail, since spaces naturally exist between english words as separators, but spaces do not exist between chinese words, in order to obtain a phrase in an application identification list, the application identification list of a user needs to be segmented to obtain a keyword list of the user, and the examples are [ 'drip', 'get car', 'get away' and [ 'king', 'glory', 'get away' ].
On this basis, the keyword list of each user can be traversed, and a corresponding keyword dictionary is built by all the keywords which appear.
For example, the key dictionary constructed for the above example may be { 1} 'drop', 2 } 'get on the car', 3 } 'beauty group', 4 } 'take away', 5 } 'king', 6 } 'glory'.
And finally, converting the keyword list of each user into keyword vectors according to the constructed keyword dictionary, forming training samples by the keyword vectors of all the users, and training according to the preset topic number and the training samples to obtain the LDA topic model.
For example, the keyword vectors corresponding to the above example transformations may be [1,2,3,2,3,4] and [5,6,3,4], and then [1,2,3,2,3,4] and [5,6,3,4] are combined into training samples, and the LDA topic model is obtained according to the preset topic number and training samples.
In one possible implementation manner, after the keyword list of each user is traversed and the corresponding keyword dictionary is built by using all the keywords that appear, in order to improve accuracy and referenceability of the feature vector, the occurrence frequency of each keyword in the built keyword dictionary in the keyword list of all the users may be further counted. For example, corresponding to the above example, the frequency of occurrence of each keyword in the keyword list of all users is {1:1,2:2,3:3,4:2,5:1,6:1}, respectively.
Then, the constructed keyword dictionary may be recoded according to occurrence frequency of each keyword in the keyword list of all users, so as to obtain a recoded keyword dictionary, and the step of converting the keyword list of each user into a keyword vector according to the constructed keyword dictionary is performed based on the recoded keyword dictionary.
For example, according to the occurrence frequency of each keyword in the keyword list of all users, the ultralow frequency nonsensical vocabulary in the keyword dictionary can be deleted, and the keyword dictionary is recoded, for example, if the vocabulary with the deletion frequency of 1 is deleted, the recoded dictionary obtained according to the above example is {2 } 'taxi', 3 } 'beauty group', 4 } 'take away', 5: 'king', 6 }.
Further, in one possible implementation manner, for step S130, the extracted topic feature vector may be input into the target wind control model, and the topic feature vector is matched based on a wind control matching rule in the target wind control model, so as to obtain a decision output result. The wind control matching rules of different target wind control models are different, and can be selected according to actual service requirements, which is not limited in detail herein.
For example, if the target wind control model is a pre-credit approval wind control model, and the final decision output result is the credit score of the user, a threshold may be preset, and for users with credit scores above this threshold, loan approval may be provided, and if the target wind control model is a user with credit scores below the threshold, the loan approval is refused to be provided.
For another example, if the target wind control model is a wind control model monitored in a credit, the final decision output results feed the user if the user can pay on time or if there is an overdue risk.
For another example, if the target wind control model is a post-credit refund wind control model, and the users are overdue and unremoved users, it is possible to determine which users are those who have a high probability of repayment, and then select those people for the key refund.
Based on the same inventive concept, please refer to fig. 2, which shows a schematic diagram of functional modules of a text list-based user feature extraction device 110 according to an embodiment of the present application, where the functional modules of the text list-based user feature extraction device 110 may be divided according to the above-mentioned method embodiment. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation. For example, in the case of dividing the respective function modules with the respective functions, the text list-based user feature extraction apparatus 110 shown in fig. 2 is only one apparatus schematic diagram. The text-list-based user feature extraction device 110 may include an obtaining module 111, a converting module 112, and a decision output module 113, and the functions of each functional module of the text-list-based user feature extraction device 110 are described in detail below.
An obtaining module 111, configured to obtain application installation list information of a user from a user terminal. It is understood that the acquisition module 111 may be used to perform the step S110 described above, and reference may be made to the details of the implementation of the acquisition module 111 regarding the step S110 described above.
The conversion module 112 is configured to convert the user installation list information into a text information list, and convert the text information list into topic feature vectors through an LDA topic model, where the topic feature vectors are feature vectors with a plurality of preset dimensions, and the feature vector with each preset feature dimension is used to represent a probability that a user belongs to a topic label corresponding to the preset feature dimension. It is understood that the conversion module 112 may be used to perform the step S120 described above, and reference may be made to the details of the implementation of the conversion module 112 regarding the step S120 described above.
The decision output module 113 is configured to input the extracted topic feature vector into a target wind control model, and perform decision output on the target wind control model. It is understood that the decision output module 113 may be used to perform the step S130, and reference may be made to the details of the implementation of the decision output module 113 in the above description of the step S130.
In one possible implementation, the conversion module 112 is specifically configured to:
converting the user installation list information into a text information list, and acquiring application program identification information corresponding to each installation package name in the text information list;
and determining a corresponding keyword vector according to the application program identification information corresponding to each installation package name, and inputting the keyword vector into a pre-trained LDA topic model to obtain a plurality of feature vectors with preset dimensions as topic feature vectors.
In one possible implementation, the LDA topic model is trained by:
acquiring user installation list information of a plurality of users collected in advance, acquiring application program identification information corresponding to each installation package name from the user installation list information, and converting the user installation list information into an application program identification list;
dividing words from the application program identification list to obtain a keyword list of each user;
traversing the keyword list of each user, and constructing a corresponding keyword dictionary by using all the keywords;
converting the keyword list of each user into keyword vectors according to the constructed keyword dictionary, forming training samples by the keyword vectors of all the users, and training according to the preset topic number and the training samples to obtain an LDA topic model;
in one possible implementation, the conversion module 112 is specifically further configured to:
after traversing the keyword list of each user, constructing a corresponding keyword dictionary by using all the keywords which appear, and counting the occurrence frequency of each keyword in the constructed keyword dictionary in the keyword list of all the users;
and recoding the constructed keyword dictionary according to the occurrence frequency of each keyword in the keyword list of all users to obtain a recoded keyword dictionary, and executing the step of converting the keyword list of each user into a keyword vector according to the constructed keyword dictionary based on the recoded keyword dictionary.
In one possible implementation, the decision output module 113 is specifically configured to:
and inputting the extracted topic feature vector into a target wind control model, and matching the topic feature vector based on a wind control matching rule in the target wind control model to obtain a decision output result.
Referring to fig. 3, which is a schematic block diagram illustrating a structure of a computer device 100 for performing the text list-based user feature extraction method according to an embodiment of the present application, the computer device 100 may include a text list-based user feature extraction apparatus 110, a machine-readable storage medium 120, and a processor 130.
In this embodiment, the machine-readable storage medium 120 and the processor 130 are both located in the computer device 100 and are separately provided. However, it should be understood that the machine-readable storage medium 120 may also be separate from the computer device 100 and accessible by the processor 130 through a bus interface. In the alternative, machine-readable storage medium 120 may be integrated into processor 130, and may be, for example, a cache and/or general purpose registers.
The text-list based user feature extraction device 110 may include software functional modules (e.g., the acquisition module 111, the transformation module 112, and the decision output module 113 shown in fig. 2) stored on the machine-readable storage medium 120 to implement the text-list based user feature extraction method provided by the foregoing method embodiments when executed by the processor 130.
Since the computer device 100 provided in the embodiment of the present application is another implementation form of the method embodiment executed by the computer device 100, and the computer device 100 may be used to execute the text list-based user feature extraction method provided in the method embodiment, the technical effects obtained by the method embodiment may refer to the method embodiment and will not be described herein.
The embodiments described above are only some, but not all, of the embodiments of the present application. The components of the embodiments of the present application, as generally described and illustrated in the figures, may be arranged and designed in a wide variety of different configurations. Accordingly, the detailed description of the embodiments of the present application provided in the drawings is not intended to limit the scope of protection of the application, but is merely representative of selected embodiments of the application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims. Moreover, all other embodiments that can be made by a person skilled in the art, based on the embodiments of the present application, without making any inventive effort, shall fall within the scope of protection of the present application.

Claims (6)

1. A text list-based user feature extraction method, for application to a computer device, the method comprising:
acquiring application installation list information of a user from a user terminal;
converting the user installation list information into a text information list, and converting the text information list into topic feature vectors through an LDA topic model, wherein the topic feature vectors are feature vectors with a plurality of preset dimensions, and the feature vector with each preset feature dimension is used for representing the probability that the user belongs to a topic label corresponding to the preset feature dimension;
inputting the extracted topic feature vector into a target wind control model, and outputting a decision to the target wind control model;
the step of converting the user installation list information into a text information list and converting the text information list into a topic feature vector through an LDA topic model comprises the following steps:
converting the user installation list information into a text information list, and acquiring application program identification information corresponding to each installation package name in the text information list;
determining a corresponding keyword vector according to the application program identification information corresponding to each installation package name, and inputting the keyword vector into a pre-trained LDA topic model to obtain a plurality of feature vectors with preset dimensions as the topic feature vectors;
the LDA topic model is trained by the following modes:
acquiring user installation list information of a plurality of users collected in advance, acquiring application program identification information corresponding to each installation package name from the user installation list information, and converting the user installation list information into an application program identification list;
segmenting the application program identification list to obtain a keyword list of each user;
traversing the keyword list of each user, and constructing a corresponding keyword dictionary by using all the keywords;
and converting the keyword list of each user into keyword vectors according to the constructed keyword dictionary, forming training samples by the keyword vectors of all the users, and training according to the preset topic number and the training samples to obtain an LDA topic model.
2. The text list-based user feature extraction method of claim 1, further comprising:
counting the occurrence frequency of each keyword in the constructed keyword dictionary in the keyword list of all users;
and recoding the constructed keyword dictionary according to the occurrence frequency of each keyword in the keyword list of all users to obtain a recoded keyword dictionary, and executing the step of converting the keyword list of each user into a keyword vector according to the constructed keyword dictionary based on the recoded keyword dictionary.
3. The text list-based user feature extraction method according to claim 1, wherein the step of inputting the extracted topic feature vector into a target wind control model to perform decision output on the target wind control model comprises the steps of:
inputting the extracted topic feature vector into a target wind control model, and matching the topic feature vector based on a wind control matching rule in the target wind control model to obtain a decision output result.
4. A text list-based user feature extraction apparatus for use with a computer device, the apparatus comprising:
the acquisition module is used for acquiring application installation list information of a user from the user terminal;
the conversion module is used for converting the user installation list information into a text information list and converting the text information list into topic feature vectors through an LDA topic model, wherein the topic feature vectors are feature vectors with a plurality of preset dimensions, and the feature vector with each preset feature dimension is used for representing the probability that the user belongs to a topic label corresponding to the preset feature dimension;
the decision output module is used for inputting the extracted topic feature vector into a target wind control model and outputting the decision of the target wind control model;
wherein, the conversion module is specifically used for:
converting the user installation list information into a text information list, and acquiring application program identification information corresponding to each installation package name in the text information list;
determining a corresponding keyword vector according to the application program identification information corresponding to each installation package name, and inputting the keyword vector into a pre-trained LDA topic model to obtain a plurality of feature vectors with preset dimensions as the topic feature vectors;
the LDA topic model is trained by the following modes:
acquiring user installation list information of a plurality of users collected in advance, acquiring application program identification information corresponding to each installation package name from the user installation list information, and converting the user installation list information into an application program identification list;
segmenting the application program identification list to obtain a keyword list of each user;
traversing the keyword list of each user, and constructing a corresponding keyword dictionary by using all the keywords;
and converting the keyword list of each user into keyword vectors according to the constructed keyword dictionary, forming training samples by the keyword vectors of all the users, and training according to the preset topic number and the training samples to obtain an LDA topic model.
5. The text list based user feature extraction apparatus of claim 4, wherein the transformation module is further specifically configured to:
after traversing the keyword list of each user, constructing a corresponding keyword dictionary by using all the keywords which appear, and counting the occurrence frequency of each keyword in the constructed keyword dictionary in the keyword list of all the users;
and recoding the constructed keyword dictionary according to the occurrence frequency of each keyword in the keyword list of all users to obtain a recoded keyword dictionary, and executing the step of converting the keyword list of each user into a keyword vector according to the constructed keyword dictionary based on the recoded keyword dictionary.
6. The text-list-based user feature extraction apparatus of claim 4, wherein the decision output module is specifically configured to:
inputting the extracted topic feature vector into a target wind control model, and matching the topic feature vector based on a wind control matching rule in the target wind control model to obtain a decision output result.
CN202010889481.1A 2020-08-28 2020-08-28 Text list-based user feature extraction method and device Active CN112036572B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010889481.1A CN112036572B (en) 2020-08-28 2020-08-28 Text list-based user feature extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010889481.1A CN112036572B (en) 2020-08-28 2020-08-28 Text list-based user feature extraction method and device

Publications (2)

Publication Number Publication Date
CN112036572A CN112036572A (en) 2020-12-04
CN112036572B true CN112036572B (en) 2024-03-12

Family

ID=73587707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010889481.1A Active CN112036572B (en) 2020-08-28 2020-08-28 Text list-based user feature extraction method and device

Country Status (1)

Country Link
CN (1) CN112036572B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199559B (en) * 2020-12-07 2021-02-19 上海冰鉴信息科技有限公司 Data feature screening method and device and computer equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN106445101A (en) * 2015-08-07 2017-02-22 飞比特公司 Method and system for identifying user
CN107391760A (en) * 2017-08-25 2017-11-24 平安科技(深圳)有限公司 User interest recognition methods, device and computer-readable recording medium
CN107818105A (en) * 2016-09-13 2018-03-20 腾讯科技(深圳)有限公司 The recommendation method and server of application program
CN108536678A (en) * 2018-04-12 2018-09-14 腾讯科技(深圳)有限公司 Text key message extracting method, device, computer equipment and storage medium
CN110472017A (en) * 2019-08-21 2019-11-19 佰聆数据股份有限公司 A kind of analysis of words art and topic point identify matched method and system
CN110598070A (en) * 2019-09-09 2019-12-20 腾讯科技(深圳)有限公司 Application type identification method and device, server and storage medium
CN110633989A (en) * 2019-08-16 2019-12-31 阿里巴巴集团控股有限公司 Method and device for determining risk behavior generation model
CN111291798A (en) * 2020-01-21 2020-06-16 北京工商大学 User basic attribute prediction method based on ensemble learning
WO2020147238A1 (en) * 2019-01-18 2020-07-23 平安科技(深圳)有限公司 Keyword determination method, automatic scoring method, apparatus and device, and medium
CN111488338A (en) * 2020-06-29 2020-08-04 上海冰鉴信息科技有限公司 Model monitoring method and device applied to wind control decision flow

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9769208B2 (en) * 2015-05-28 2017-09-19 International Business Machines Corporation Inferring security policies from semantic attributes

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN106445101A (en) * 2015-08-07 2017-02-22 飞比特公司 Method and system for identifying user
CN107818105A (en) * 2016-09-13 2018-03-20 腾讯科技(深圳)有限公司 The recommendation method and server of application program
CN107391760A (en) * 2017-08-25 2017-11-24 平安科技(深圳)有限公司 User interest recognition methods, device and computer-readable recording medium
CN108536678A (en) * 2018-04-12 2018-09-14 腾讯科技(深圳)有限公司 Text key message extracting method, device, computer equipment and storage medium
WO2020147238A1 (en) * 2019-01-18 2020-07-23 平安科技(深圳)有限公司 Keyword determination method, automatic scoring method, apparatus and device, and medium
CN110633989A (en) * 2019-08-16 2019-12-31 阿里巴巴集团控股有限公司 Method and device for determining risk behavior generation model
CN110472017A (en) * 2019-08-21 2019-11-19 佰聆数据股份有限公司 A kind of analysis of words art and topic point identify matched method and system
CN110598070A (en) * 2019-09-09 2019-12-20 腾讯科技(深圳)有限公司 Application type identification method and device, server and storage medium
CN111291798A (en) * 2020-01-21 2020-06-16 北京工商大学 User basic attribute prediction method based on ensemble learning
CN111488338A (en) * 2020-06-29 2020-08-04 上海冰鉴信息科技有限公司 Model monitoring method and device applied to wind control decision flow

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Junji Shimagaki等.Automatic topic classification of test cases using text mining at an Android smartphone vendor.《ESEM'18: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement》.2018,(第32期),第1-10页. *
基于LDA模型的论坛热点话题识别和追踪;徐佳俊等;《中文信息学报》;第30卷(第1期);第43-49页 *
基于多注意力的中文命名实体识别;顾凌云;《信息与电脑》(第9期);第41-44+48页 *
基于移动软件行为大数据挖掘的恶意软件检测技术;张巍等;《集成技术》;第5卷(第2期);第29-40页 *
对流行度敏感的APP主题推荐模型;杨肖参;《中国优秀硕士学位论文全文数据库 信息科技辑》(第3期);第I138-6364页 *

Also Published As

Publication number Publication date
CN112036572A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN108376151B (en) Question classification method and device, computer equipment and storage medium
EP3866026A1 (en) Theme classification method and apparatus based on multimodality, and storage medium
CN110929043B (en) Service problem extraction method and device
CN110909165A (en) Data processing method, device, medium and electronic equipment
CN113240510B (en) Abnormal user prediction method, device, equipment and storage medium
CN113239807B (en) Method and device for training bill identification model and bill identification
CN112559688A (en) Financial newspaper reading difficulty calculation method, device and equipment and readable storage medium
CN112036572B (en) Text list-based user feature extraction method and device
CN111460146A (en) Short text classification method and system based on multi-feature fusion
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN114022737A (en) Method and apparatus for updating training data set
CN116246287B (en) Target object recognition method, training device and storage medium
CN113379528A (en) Wind control model establishing method and device and risk control method
CN108073567B (en) Feature word extraction processing method, system and server
CN112463922A (en) Risk user identification method and storage medium
WO2023071129A1 (en) Method for identifying proportion of green assets and related product
CN113724700B (en) Language identification and language identification model training method and device
CN114118062A (en) Customer feature extraction method and device, electronic equipment and storage medium
US20220083581A1 (en) Text classification device, text classification method, and text classification program
CN114969195A (en) Dialogue content mining method and dialogue content evaluation model generation method
CN114611625A (en) Language model training method, language model training device, language model data processing method, language model data processing device, language model data processing equipment, language model data processing medium and language model data processing product
CN109344252B (en) Microblog text classification method and system based on high-quality theme extension
CN113010680A (en) Electric power work order text classification method and device and terminal equipment
CN115471893B (en) Face recognition model training, face recognition method and device
CN115618968B (en) New idea discovery method and device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant