CN111949867A - Cross-APP user behavior analysis model training method, analysis method and related equipment - Google Patents

Cross-APP user behavior analysis model training method, analysis method and related equipment Download PDF

Info

Publication number
CN111949867A
CN111949867A CN202010798039.8A CN202010798039A CN111949867A CN 111949867 A CN111949867 A CN 111949867A CN 202010798039 A CN202010798039 A CN 202010798039A CN 111949867 A CN111949867 A CN 111949867A
Authority
CN
China
Prior art keywords
buried point
user behavior
target
point data
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010798039.8A
Other languages
Chinese (zh)
Inventor
杜宇衡
萧梓健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010798039.8A priority Critical patent/CN111949867A/en
Publication of CN111949867A publication Critical patent/CN111949867A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Computational Linguistics (AREA)
  • Finance (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Fuzzy Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to artificial intelligence, and provides a cross-APP user behavior analysis model training method, a user behavior analysis method, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of buried point data of a plurality of users in a plurality of APPs, and sequencing the plurality of buried point data corresponding to each user to obtain a first buried point data sequence; coding a first buried point data sequence of each user into a first JSON string; segmenting each first JSON string into a plurality of text segments; calculating a TF-IDF value of each text segment, and constructing a user behavior vector according to the TF-IDF value; and training the lightGBM network based on the user behavior vectors of the plurality of users to obtain a cross-APP user behavior analysis model. According to the method and the device, the integration of the data of the embedded points crossing the APP is realized for the first time, the data sequence of the embedded points crossing the APP is coded into the JSON string, the efficiency of model training is improved, and the performance of model analysis is improved. In addition, the application also relates to a block chain technology, and the user behavior analysis model across APP can be stored in the block chain.

Description

Cross-APP user behavior analysis model training method, analysis method and related equipment
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a cross-APP-based user behavior analysis model training method, a cross-APP-based user behavior analysis method, computer equipment and a storage medium.
Background
In common big data mining applications, the click behavior of a user on an APP becomes an important information source for user representation, consumption prediction and commodity recommendation. And bury the statistics instrument that the point is last page or button of APP, whenever the user clicks a certain button or stops in a certain page, bury the point and be triggered, APP can upload the user at this APP click or browse the information automatically. By processing the embedded point sequence of the user, the behavior habit of the user can be analyzed, and various commercial applications are realized based on the behavior habit of the user.
The inventor finds that only one APP is considered in the current processing mode of the embedded point sequence in the process of implementing the invention, and little exploration is provided for processing the embedded point sequence across the APPs. The data volume of the buried point sequence in an APP is small, and the behavior habit of a user cannot be accurately analyzed.
Therefore, it is necessary to provide a method for processing a buried point sequence across APPs.
Disclosure of Invention
In view of the above, it is necessary to provide a cross-APP user behavior analysis model training party, a cross-APP user behavior analysis method, a computer device, and a storage medium, which implement integration of cross-APP buried point data for the first time, and encode the cross-APP buried point data sequence into a JSON string, thereby not only improving the efficiency of model training, but also improving the performance of model analysis.
The invention provides a cross-APP user behavior analysis model training method, which comprises the following steps:
acquiring a plurality of buried point data of a plurality of users in a plurality of APPs, and sequencing the plurality of buried point data corresponding to each user to obtain a first buried point data sequence;
coding a first buried point data sequence of each user into a first JSON string;
segmenting each first JSON string into a plurality of text segments;
calculating a TF-IDF value of each text segment, and constructing a user behavior vector according to the TF-IDF value;
and training the lightGBM network based on the user behavior vectors of the plurality of users to obtain a cross-APP user behavior analysis model.
According to an alternative embodiment of the present invention, said encoding the first buried point data sequence for each user into a first JSON string comprises:
acquiring a plurality of buried point data in the first buried point data sequence, wherein the buried point data comprises: user identification, a buried point timestamp, a prefix and buried point content;
coding each buried point timestamp and the corresponding prefix and buried point content of each user into a sub JSON string;
and coding the user identification of each user and the corresponding plurality of sub JSON strings into a nested first JSON string.
According to an alternative embodiment of the present invention, the splitting each JSON string into a plurality of text segments comprises:
analyzing the first JSON string to obtain a plurality of buried point contents;
sliding from the first buried point content in the plurality of buried point contents to the last buried point content in the plurality of buried point contents by adopting a preset sliding window;
and taking the corresponding buried point content of each sliding window as a text fragment.
According to an alternative embodiment of the invention, said calculating the TF-IDF value for each text passage comprises:
calculating the word frequency of each text segment in all the text segments;
acquiring the inverse document frequency of each text fragment;
and calculating the product of the word frequency and the inverse document frequency to obtain the TF-IDF value of the text segment.
According to an alternative embodiment of the present invention, said constructing a user behavior vector based on said TF-IDF values comprises:
screening a plurality of target text segments from the plurality of text segments;
and constructing a user behavior vector according to the TF-IDF values corresponding to the target text segments.
According to an alternative embodiment of the present invention, the constructing the user behavior vector according to the TF-IDF values corresponding to the target text segments includes:
initializing a null vector with a preset length;
sequencing TF-IDF values corresponding to the target text segments in a descending order;
and sequentially filling the sequenced TF-IDF values in the corresponding positions of the empty vector to obtain the user behavior vector.
A second aspect of the present invention provides a cross-APP user behavior analysis method, including:
acquiring a second buried point data sequence of a target user in a plurality of APPs;
encoding the second buried point data sequence into a second JSON string;
and inputting the second JSON string into a cross-APP user behavior analysis model for analysis to obtain an analysis result, wherein the user behavior analysis model is obtained by utilizing the cross-APP user behavior analysis model training method.
According to an alternative embodiment of the present invention, said encoding said second buried point data sequence into a second JSON string comprises:
acquiring a plurality of target buried point data in the second buried point data sequence, wherein the target buried point data comprises: target user identification, target buried point timestamp, target prefix and target buried point content;
coding each target buried point timestamp, the corresponding target prefix and the target buried point content into a target sub JSON string;
and coding the target user identification and the corresponding multiple target sub JSON strings into a nested second JSON string.
A third aspect of the present invention provides a computer apparatus comprising:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the cross-APP user behavior analysis model training method or realize the cross-APP user behavior analysis method.
A fourth aspect of the present invention provides a computer-readable storage medium, where at least one instruction is stored, and the at least one instruction is executed by a processor in a computer device to implement the cross-APP user behavior analysis model training method or the cross-APP user behavior analysis method.
According to the invention, the integration of the embedded point data across the APPs is realized for the first time, and the embedded point data sequence of each user in a plurality of APPs is obtained; secondly, the embedded data sequence across APP is coded into a nested JSON string, so that the data format of the constructed model is met, the transportability is strong, the data can be conveniently and rapidly read, and the model training efficiency is improved; in addition, the JSON string is segmented into a plurality of text segments, a natural language processing model can be adopted for learning and training, and the trained cross-APP user behavior analysis model can be used for various applications, such as user performance assessment, and assessment efficiency is improved; the method is used for commodity recommendation and improves recommendation quality.
Drawings
FIG. 1 is a flowchart of a cross-APP user behavior analysis model training method according to a preferred embodiment of the present invention.
FIG. 2 is a flowchart of a cross-APP user behavior analysis method according to a preferred embodiment of the present invention.
FIG. 3 is a functional block diagram of a cross-APP user behavior analysis model training apparatus according to a preferred embodiment of the present invention.
FIG. 4 is a functional block diagram of a cross-APP user behavior analysis apparatus according to a preferred embodiment of the present invention.
Fig. 5 is a schematic diagram of the structure of the computer device of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a flowchart illustrating a method for training a cross-APP user behavior analysis model according to a preferred embodiment of the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.
S11, obtaining a plurality of buried point data of a plurality of users in a plurality of APPs, and sequencing the plurality of buried point data corresponding to each user to obtain a first buried point data sequence.
A plurality of APPs are installed in computer equipment, a recorder is buried in any page or any button of the plurality of APPs in advance, and the buried recorder is called as a buried point. The embedded point refers to a page or a statistical tool of buttons in the APP, and when a user clicks a button or stays on a page, the button or the embedded point corresponding to the page is triggered, and the triggered embedded point can automatically report the embedded point data.
The method comprises the steps that computer equipment obtains a plurality of first buried point data of each user reported by each APP within a preset time period, and sequences the plurality of first buried point data of the same user reported by all APPs according to a time sequence to obtain a first buried point data sequence. Each user corresponds to a first buried point data sequence.
S12, the first buried point data sequence of each user is coded into a first JSON string.
JSON refers to JavaScript Object Notation (JavaScript Object Notation), which is a lightweight format for exchanging text data. JSON is language independent, self-descriptive, and easy to understand. Because modeling based on the buried point data sequence is generally written by using python language, the first buried point data sequence is coded into the first JSON string, and cross-platform processing and quick reading are facilitated.
In an alternative embodiment, the encoding the first buried point data sequence of each user into the first JSON string comprises:
acquiring a plurality of buried point data in the first buried point data sequence, wherein the buried point data comprises: user identification, a buried point timestamp, a prefix and buried point content;
coding each buried point timestamp and the corresponding prefix and buried point content of each user into a sub JSON string;
and coding the user identification of each user and the corresponding plurality of sub JSON strings into a nested first JSON string.
The user identification is used for representing the identity information of the user, and different users have different identifications. The buried point timestamp is used for representing the time point when the buried point is triggered. The prefix is used to distinguish the source of the buried point content.
One operation behavior of a user represents one buried point data, each buried point data is used as a JOSN substring, and each user only has one nested JSON string. Illustratively, the nested JSON string is { user ID, { click time, prefix-buried point content } }.
In the prior art, a large amount of buried point data needs to be cleaned before model training is performed based on the buried point data, and the cleaning process is the most time-consuming. According to the method, the buried point data sequence of the user is coded into the JSON string, the buried point data is converted into the text data which can be recognized and learned by natural language, the JSON string is more convenient to read, the reading efficiency is improved, and the efficiency of training the model can be improved.
In addition, the time sequence represents the sequence of the clicking behaviors of the user, so that the buried point data sequence of the user is encoded into JSON according to the time sequence, the up-down operation behaviors of the user can be associated, context information is formed, more data are provided for the training of the model, and the performance of the model is improved.
And S13, segmenting each first JSON string into a plurality of text segments.
One JSON string of a user represents one text, and the computer device may employ an N-GRAM model to segment the first JSON string of each user into a plurality of text segments.
The N-GRAM Model is a Language Model (LM), which is a probability-based discriminant Model whose input is a sentence (sequential sequence of words) and output is the probability of the sentence, i.e., the joint probability of the words. An N-gram itself also refers to a set of N words, each in a sequential order, and does not require that the words differ from one another.
In an alternative embodiment, N-2.
In an optional embodiment, the segmenting each of the first JSON strings into a plurality of text segments includes:
analyzing the first JSON string to obtain a plurality of buried point contents;
sliding from the first buried point content in the plurality of buried point contents to the last buried point content in the plurality of buried point contents by adopting a preset sliding window;
and taking the corresponding buried point content of each sliding window as a text fragment.
Wherein the size of the sliding window is N.
The computer equipment reads the first JSON string and analyzes the first JSON string according to a preset format to obtain a plurality of buried point contents.
Illustratively, assume that the multiple-buried-point content includes: open APP, login, circle, health, finance. And sliding the sliding window with the size of 2 in the plurality of buried point contents to obtain an opened APP login text segment, a login circle text segment, a circle health text segment and a health financial text segment.
And S14, calculating the TF-IDF value of each text segment, and constructing a user behavior vector according to the TF-IDF value.
The TF-IDF value for each text segment may be calculated using a TF-IDF model. The TF-IDF model is a statistical method for evaluating the importance of text segments in the entire corpus. And calculating the TF-IDF value of the text segment to distinguish whether the text segment is an important text segment, thereby facilitating the filtering operation of the text segment.
In an alternative embodiment, the calculating the TF-IDF value for each text segment includes:
calculating the word frequency of each text segment in all the text segments;
acquiring the inverse document frequency of each text fragment;
and calculating the product of the word frequency and the inverse document frequency to obtain the TF-IDF value of the text segment.
TF refers to the Term Frequency (Term Frequency) and refers to the Frequency with which a given text segment appears in a corpus of all text segments. TF is the normalization of the number of words (Term Count) to prevent it from being biased towards long text (the same word may have a higher number of words in a long document than in a short document, regardless of the importance of the word). The IDF is an Inverse document frequency (Inverse document frequency), and the IDF of a given text segment can be obtained by dividing the total number of text segments by the number of text segments and taking the logarithm of the obtained quotient.
In an alternative embodiment, said constructing a user behavior vector based on said TF-IDF values comprises:
screening a plurality of target text segments from the plurality of text segments;
and constructing a user behavior vector according to the TF-IDF values corresponding to the target text segments.
The computer device presets a threshold value, when the TF-IDF value of a certain text segment is smaller than the threshold value, the text segment is indicated to belong to common words, no practical significance is brought to model training, and the text segment can be eliminated. When the TF-IDF value of a certain text segment is greater than or equal to the threshold value, the text segment is indicated to have a specific meaning, which plays a critical role in the training of the model and needs to be preserved.
And screening out a text segment corresponding to the TF-IDF value smaller than the preset threshold value from the plurality of text segments by the computer equipment as a target text segment.
In an optional embodiment, the constructing the user behavior vector according to the TF-IDF values corresponding to the target text segments includes:
initializing a null vector with a preset length;
sequencing TF-IDF values corresponding to the target text segments in a descending order;
and sequentially filling the sequenced TF-IDF values in the corresponding positions of the empty vector to obtain the user behavior vector.
During specific operation, a vector with the length of x is newly established, the 1 st dimension of the vector corresponds to a text segment with the first high TF-IDF value, the filling value is the TF-IDF value of the text segment, the 2 nd dimension of the vector corresponds to a text segment with the second high TF-IDF value, the filling value is the TF-IDF value of the text segment, and the rest is analogized to all dimensions.
By using the N-GRAM model and the TF-IDF model, a buried point data sequence of any user can be decomposed into text segments with a plurality of lengths, and a user behavior vector of the user is obtained.
S15, training the lightGBM network based on the user behavior vectors of the users to obtain a cross-APP user behavior analysis model.
In this embodiment, each piece of buried point data can be regarded as a sentence, and the buried point sequence is equivalent to a text. The embedded point data sequence of the user crossing the APP is coded into a JSON string and is segmented into a plurality of text segments, and the text segments can be trained by adopting a natural language processing model, so that the behavior habit of the user can be analyzed based on the embedded point data sequence of the user crossing the APP.
The computer device initializes network parameters in the lightGBM network in advance, and trains the lightGBM network by taking user behavior vectors of a plurality of users as a data set. And when the training is finished, updating network parameters in the lightGBM network, and taking the lightGBM network after the parameters are updated as a cross-APP user behavior analysis model. Regarding the training process of the user behavior analysis model across APPs, the present invention is not elaborated here for the prior art.
In summary, the invention realizes the integration of the embedded point data across the APPs for the first time, and obtains the embedded point data sequence of each user in a plurality of APPs; secondly, the embedded data sequence across APP is coded into a nested JSON string, so that the data format of the constructed model is met, the transportability is strong, the data can be conveniently and rapidly read, and the model training efficiency is improved; in addition, the JSON string is segmented into a plurality of text segments, a natural language processing model can be adopted for learning and training, and the trained cross-APP user behavior analysis model can be used for various applications, such as user performance assessment, and assessment efficiency is improved; the method is used for commodity recommendation and improves recommendation quality.
It should be noted that the present invention can also be applied to intelligent government affairs to promote the construction of intelligent cities. In addition, in order to ensure the privacy and security of the cross-APP user behavior analysis model, the cross-APP user behavior analysis model may be stored in one node of the blockchain.
Fig. 2 is a flowchart of a cross-APP user behavior analysis method according to a preferred embodiment of the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.
And S21, acquiring a second buried point data sequence of the target user in the plurality of APPs.
The method comprises the steps that computer equipment obtains a target user identification of a target user, obtains a plurality of target buried point data of the target user in a plurality of APPs according to the target user identification, and sequences the target buried point data according to a time sequence to obtain a second buried point data sequence of the target user.
And S22, encoding the second buried point data sequence into a second JSON string.
In an optional embodiment, the encoding the second buried point data sequence into a second JSON string comprises:
acquiring a plurality of target buried point data in the second buried point data sequence, wherein the target buried point data comprises: target user identification, target buried point timestamp, target prefix and target buried point content;
coding each target buried point timestamp, the corresponding target prefix and the target buried point content into a target sub JSON string;
and coding the target user identification and the corresponding multiple target sub JSON strings into a nested second JSON string.
Since the data form of the training cross-APP user behavior analysis model is the nested first JSON string, in order to better analyze the behavior habit of the target user, the second buried point data sequence of the target user needs to be encoded into the nested second JSON string.
And S23, inputting the second JSON string into a cross-APP user behavior analysis model for analysis to obtain an analysis result.
And after reading the second JSON string, the computer equipment inputs the second JSON string into a cross-APP user behavior analysis model, and the behavior habit of the user can be determined according to the analysis result. The training process of the cross-APP user behavior analysis model is as described in embodiment one and related description.
Illustratively, the analysis result includes a first identifier and a second identifier. The first indicator is used to indicate that the target user is a low performance user and the second indicator is used to indicate that the target user is a non-low performance user.
FIG. 3 is a functional block diagram of a cross-APP user behavior analysis model training apparatus according to a preferred embodiment of the present invention.
The cross-APP user behavior analysis model training device 30 includes: the device comprises a first acquisition module 301, a first coding module 302, a JSON string segmentation module 303, a TF-IDF calculation module 304, a vector construction module 305 and a model training module 306. The module referred to herein is a series of computer program segments stored in a memory that can be executed by a processor and that can perform a fixed function. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The first obtaining module 301 is configured to obtain multiple buried point data of multiple users in multiple APPs, and sort the multiple buried point data corresponding to each user to obtain a first buried point data sequence.
A plurality of APPs are installed in computer equipment, a recorder is buried in any page or any button of the plurality of APPs in advance, and the buried recorder is called as a buried point. The embedded point refers to a page or a statistical tool of buttons in the APP, and when a user clicks a button or stays on a page, the button or the embedded point corresponding to the page is triggered, and the triggered embedded point can automatically report the embedded point data.
The method comprises the steps that computer equipment obtains a plurality of first buried point data of each user reported by each APP within a preset time period, and sequences the plurality of first buried point data of the same user reported by all APPs according to a time sequence to obtain a first buried point data sequence. Each user corresponds to a first buried point data sequence.
The first encoding module 302 is configured to encode the first buried point data sequence of each user into a first JSON string.
JSON refers to JavaScript Object Notation (JavaScript Object Notation), which is a lightweight format for exchanging text data. JSON is language independent, self-descriptive, and easy to understand. Because modeling based on the buried point data sequence is generally written by using python language, the first buried point data sequence is coded into the first JSON string, and cross-platform processing and quick reading are facilitated.
In an alternative embodiment, the first encoding module 302 encoding the first buried point data sequence for each user into a first JSON string includes:
acquiring a plurality of buried point data in the first buried point data sequence, wherein the buried point data comprises: user identification, a buried point timestamp, a prefix and buried point content;
coding each buried point timestamp and the corresponding prefix and buried point content of each user into a sub JSON string;
and coding the user identification of each user and the corresponding plurality of sub JSON strings into a nested first JSON string.
The user identification is used for representing the identity information of the user, and different users have different identifications. The buried point timestamp is used for representing the time point when the buried point is triggered. The prefix is used to distinguish the source of the buried point content.
One operation behavior of a user represents one buried point data, each buried point data is used as a JOSN substring, and each user only has one nested JSON string. Illustratively, the nested JSON string is { user ID, { click time, prefix-buried point content } }.
In the prior art, a large amount of buried point data needs to be cleaned before model training is performed based on the buried point data, and the cleaning process is the most time-consuming. According to the method, the buried point data sequence of the user is coded into the JSON string, the buried point data is converted into the text data which can be recognized and learned by natural language, the JSON string is more convenient to read, the reading efficiency is improved, and the efficiency of training the model can be improved.
In addition, the time sequence represents the sequence of the clicking behaviors of the user, so that the buried point data sequence of the user is encoded into JSON according to the time sequence, the up-down operation behaviors of the user can be associated, context information is formed, more data are provided for the training of the model, and the performance of the model is improved.
The JSON string segmenting module 303 is configured to segment each first JSON string into a plurality of text segments.
One JSON string of a user represents one text, and the computer device may employ an N-GRAM model to segment the first JSON string of each user into a plurality of text segments.
The N-GRAM Model is a Language Model (LM), which is a probability-based discriminant Model whose input is a sentence (sequential sequence of words) and output is the probability of the sentence, i.e., the joint probability of the words. An N-gram itself also refers to a set of N words, each in a sequential order, and does not require that the words differ from one another.
In an alternative embodiment, N-2.
In an alternative embodiment, the JSON string segmenting module 303 segmenting each first JSON string into a plurality of text segments includes:
analyzing the first JSON string to obtain a plurality of buried point contents;
sliding from the first buried point content in the plurality of buried point contents to the last buried point content in the plurality of buried point contents by adopting a preset sliding window;
and taking the corresponding buried point content of each sliding window as a text fragment.
Wherein the size of the sliding window is N.
The computer equipment reads the first JSON string and analyzes the first JSON string according to a preset format to obtain a plurality of buried point contents.
Illustratively, assume that the multiple-buried-point content includes: open APP, login, circle, health, finance. And sliding the sliding window with the size of 2 in the plurality of buried point contents to obtain an opened APP login text segment, a login circle text segment, a circle health text segment and a health financial text segment.
The TF-IDF calculation module 304 for calculating a TF-IDF value for each text segment,
the TF-IDF value for each text segment may be calculated using a TF-IDF model. The TF-IDF model is a statistical method for evaluating the importance of text segments in the entire corpus. And calculating the TF-IDF value of the text segment to distinguish whether the text segment is an important text segment, thereby facilitating the filtering operation of the text segment.
In an alternative embodiment, the TF-IDF calculation module 304 calculates the TF-IDF value for each text passage includes:
calculating the word frequency of each text segment in all the text segments;
acquiring the inverse document frequency of each text fragment;
and calculating the product of the word frequency and the inverse document frequency to obtain the TF-IDF value of the text segment.
TF refers to the Term Frequency (Term Frequency) and refers to the Frequency with which a given text segment appears in a corpus of all text segments. TF is the normalization of the number of words (Term Count) to prevent it from being biased towards long text (the same word may have a higher number of words in a long document than in a short document, regardless of the importance of the word). The IDF is an Inverse document frequency (Inverse document frequency), and the IDF of a given text segment can be obtained by dividing the total number of text segments by the number of text segments and taking the logarithm of the obtained quotient.
The vector construction module 305 is configured to construct a user behavior vector according to the TF-IDF value.
In an alternative embodiment, the vector construction module 305 constructing the user behavior vector according to the TF-IDF value comprises:
screening a plurality of target text segments from the plurality of text segments;
and constructing a user behavior vector according to the TF-IDF values corresponding to the target text segments.
The computer device presets a threshold value, when the TF-IDF value of a certain text segment is smaller than the threshold value, the text segment is indicated to belong to common words, no practical significance is brought to model training, and the text segment can be eliminated. When the TF-IDF value of a certain text segment is greater than or equal to the threshold value, the text segment is indicated to have a specific meaning, which plays a critical role in the training of the model and needs to be preserved.
And screening out a text segment corresponding to the TF-IDF value smaller than the preset threshold value from the plurality of text segments by the computer equipment as a target text segment.
In an optional embodiment, the constructing the user behavior vector according to the TF-IDF values corresponding to the target text segments includes:
initializing a null vector with a preset length;
sequencing TF-IDF values corresponding to the target text segments in a descending order;
and sequentially filling the sequenced TF-IDF values in the corresponding positions of the empty vector to obtain the user behavior vector.
During specific operation, a vector with the length of x is newly established, the 1 st dimension of the vector corresponds to a text segment with the first high TF-IDF value, the filling value is the TF-IDF value of the text segment, the 2 nd dimension of the vector corresponds to a text segment with the second high TF-IDF value, the filling value is the TF-IDF value of the text segment, and the rest is analogized to all dimensions.
By using the N-GRAM model and the TF-IDF model, a buried point data sequence of any user can be decomposed into text segments with a plurality of lengths, and a user behavior vector of the user is obtained.
The model training module 306 is configured to train the lightGBM network based on the user behavior vectors of the multiple users to obtain an APP-crossing user behavior analysis model.
In this embodiment, each piece of buried point data can be regarded as a sentence, and the buried point sequence is equivalent to a text. The embedded point data sequence of the user crossing the APP is coded into a JSON string and is segmented into a plurality of text segments, and the text segments can be trained by adopting a natural language processing model, so that the behavior habit of the user can be analyzed based on the embedded point data sequence of the user crossing the APP.
The computer device initializes network parameters in the lightGBM network in advance, and trains the lightGBM network by taking user behavior vectors of a plurality of users as a data set. And when the training is finished, updating network parameters in the lightGBM network, and taking the lightGBM network after the parameters are updated as a cross-APP user behavior analysis model. Regarding the training process of the user behavior analysis model across APPs, the present invention is not elaborated here for the prior art.
In summary, the invention realizes the integration of the embedded point data across the APPs for the first time, and obtains the embedded point data sequence of each user in a plurality of APPs; secondly, the embedded data sequence across APP is coded into a nested JSON string, so that the data format of the constructed model is met, the transportability is strong, the data can be conveniently and rapidly read, and the model training efficiency is improved; in addition, the JSON string is segmented into a plurality of text segments, a natural language processing model can be adopted for learning and training, and the trained cross-APP user behavior analysis model can be used for various applications, such as user performance assessment, and assessment efficiency is improved; the method is used for commodity recommendation and improves recommendation quality.
It should be noted that the present invention can also be applied to intelligent government affairs to promote the construction of intelligent cities. In addition, in order to ensure the privacy and security of the cross-APP user behavior analysis model, the cross-APP user behavior analysis model may be stored in one node of the blockchain.
Fig. 4 is a functional block diagram of a cross-APP user behavior analysis apparatus according to a preferred embodiment of the present invention.
The cross-APP user behavior analysis device 40 includes: a second obtaining module 401, a second encoding module 402 and a behavior analyzing module 403. The module referred to herein is a series of computer program segments stored in a memory that can be executed by a processor and that can perform a fixed function. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The second obtaining module 401 is configured to obtain a second buried point data sequence of the target user in multiple APPs.
The method comprises the steps that computer equipment obtains a target user identification of a target user, obtains a plurality of target buried point data of the target user in a plurality of APPs according to the target user identification, and sequences the target buried point data according to a time sequence to obtain a second buried point data sequence of the target user.
The second encoding module 402 is configured to encode the second buried point data sequence into a second JSON string.
In an alternative embodiment, the second encoding module 402 encoding the second buried point data sequence into a second JSON string includes:
acquiring a plurality of target buried point data in the second buried point data sequence, wherein the target buried point data comprises: target user identification, target buried point timestamp, target prefix and target buried point content;
coding each target buried point timestamp, the corresponding target prefix and the target buried point content into a target sub JSON string;
and coding the target user identification and the corresponding multiple target sub JSON strings into a nested second JSON string.
Since the data form of the training cross-APP user behavior analysis model is the nested first JSON string, in order to better analyze the behavior habit of the target user, the second buried point data sequence of the target user needs to be encoded into the nested second JSON string.
And the behavior analysis module 403 is configured to input the second JSON string to a user behavior analysis model across APPs for analysis to obtain an analysis result.
And after reading the second JSON string, the computer equipment inputs the second JSON string into a cross-APP user behavior analysis model, and the behavior habit of the user can be determined according to the analysis result. The training process of the cross-APP user behavior analysis model is as described in embodiment one and related description.
Illustratively, the analysis result includes a first identifier and a second identifier. The first indicator is used to indicate that the target user is a low performance user and the second indicator is used to indicate that the target user is a non-low performance user.
Fig. 5 is a schematic structural diagram of the computer device according to the present invention. In the preferred embodiment of the present invention, the computer device 5 includes a memory 51, at least one processor 52, at least one communication bus 53, and a transceiver 54.
It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 5 is not limiting to the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 5 may include more or less hardware or software than those shown, or a different arrangement of components.
In some embodiments, the computer device 5 is a computer device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 5 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the computer device 5 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are incorporated herein by reference.
In some embodiments, the memory 51 has stored therein a computer program that, when executed by the at least one processor 52, implements all or part of the steps of the cross-APP user behavior analysis model training method; or all or part of the steps in the cross-APP user behavior analysis method are realized. The Memory 51 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In some embodiments, the at least one processor 52 is a Control Unit (Control Unit) of the computer device 5, connects various components of the entire computer device 5 by using various interfaces and lines, and executes various functions and processes data of the computer device 5 by running or executing programs or modules stored in the memory 51 and calling data stored in the memory 51. For example, the at least one processor 52, when executing the computer program stored in the memory, implements all or part of the steps of the method for training a user behavior analysis model across APPs according to the embodiment of the present invention; or all or part of the steps in the cross-APP user behavior analysis method are realized. The at least one processor 52 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.
In some embodiments, the at least one communication bus 53 is arranged to enable connection communication between the memory 51 and the at least one processor 52, etc.
Although not shown, the computer device 5 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 52 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 5 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for training a cross-APP user behavior analysis model is characterized by comprising the following steps:
acquiring a plurality of buried point data of a plurality of users in a plurality of APPs, and sequencing the plurality of buried point data corresponding to each user to obtain a first buried point data sequence;
coding a first buried point data sequence of each user into a first JSON string;
segmenting each first JSON string into a plurality of text segments;
calculating a TF-IDF value of each text segment, and constructing a user behavior vector according to the TF-IDF value;
and training the lightGBM network based on the user behavior vectors of the plurality of users to obtain a cross-APP user behavior analysis model.
2. The method of claim 1, wherein encoding the first buried point data sequence of each user as a first JSON string comprises:
acquiring a plurality of buried point data in the first buried point data sequence, wherein the buried point data comprises: user identification, a buried point timestamp, a prefix and buried point content;
coding each buried point timestamp and the corresponding prefix and buried point content of each user into a sub JSON string;
and coding the user identification of each user and the corresponding plurality of sub JSON strings into a nested first JSON string.
3. The method for training the cross-APP user behavior analysis model according to claim 1, wherein the splitting each first JSON string into a plurality of text segments comprises:
analyzing the first JSON string to obtain a plurality of buried point contents;
sliding from the first buried point content in the plurality of buried point contents to the last buried point content in the plurality of buried point contents by adopting a preset sliding window;
and taking the corresponding buried point content of each sliding window as a text fragment.
4. The method of claim 1, wherein calculating the TF-IDF value for each text segment comprises:
calculating the word frequency of each text segment in all the text segments;
acquiring the inverse document frequency of each text fragment;
and calculating the product of the word frequency and the inverse document frequency to obtain the TF-IDF value of the text segment.
5. The method of claim 1, wherein the constructing a user behavior vector from the TF-IDF values comprises:
screening a plurality of target text segments from the plurality of text segments;
and constructing a user behavior vector according to the TF-IDF values corresponding to the target text segments.
6. The method of claim 5, wherein the constructing the user behavior vector according to the TF-IDF values corresponding to the target text segments comprises:
initializing a null vector with a preset length;
sequencing TF-IDF values corresponding to the target text segments in a descending order;
and sequentially filling the sequenced TF-IDF values in the corresponding positions of the empty vector to obtain the user behavior vector.
7. A cross-APP user behavior analysis method is characterized by comprising the following steps:
acquiring a second buried point data sequence of a target user in a plurality of APPs;
encoding the second buried point data sequence into a second JSON string;
inputting the second JSON string into a cross-APP user behavior analysis model for analysis to obtain an analysis result, wherein the cross-APP user behavior analysis model is obtained by training through the cross-APP user behavior analysis model training method according to any one of claims 1-6.
8. The cross-APP user behavior analysis method of claim 7, wherein the encoding the second buried point data sequence into a second JSON string comprises:
acquiring a plurality of target buried point data in the second buried point data sequence, wherein the target buried point data comprises: target user identification, target buried point timestamp, target prefix and target buried point content;
coding each target buried point timestamp, the corresponding target prefix and the target buried point content into a target sub JSON string;
and coding the target user identification and the corresponding multiple target sub JSON strings into a nested second JSON string.
9. A computer device, characterized in that the computer device comprises:
a memory storing at least one instruction; and
a processor executing instructions stored in the memory to implement a cross-APP user behavior analysis model training method as claimed in any one of claims 1 to 6, or to implement a cross-APP user behavior analysis method as claimed in claim 7 or 8.
10. A computer-readable storage medium having stored therein at least one instruction for execution by a processor in a computer device to implement a cross-APP user behavior analysis model training method as claimed in any one of claims 1 to 6, or to implement a cross-APP user behavior analysis method as claimed in claim 7 or 8.
CN202010798039.8A 2020-08-10 2020-08-10 Cross-APP user behavior analysis model training method, analysis method and related equipment Pending CN111949867A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010798039.8A CN111949867A (en) 2020-08-10 2020-08-10 Cross-APP user behavior analysis model training method, analysis method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010798039.8A CN111949867A (en) 2020-08-10 2020-08-10 Cross-APP user behavior analysis model training method, analysis method and related equipment

Publications (1)

Publication Number Publication Date
CN111949867A true CN111949867A (en) 2020-11-17

Family

ID=73332045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010798039.8A Pending CN111949867A (en) 2020-08-10 2020-08-10 Cross-APP user behavior analysis model training method, analysis method and related equipment

Country Status (1)

Country Link
CN (1) CN111949867A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064440A (en) * 2022-01-18 2022-02-18 恒生电子股份有限公司 Training method of credibility analysis model, credibility analysis method and related device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316198A (en) * 2016-04-26 2017-11-03 阿里巴巴集团控股有限公司 Account risk identification method and device
CN107944059A (en) * 2017-12-29 2018-04-20 深圳市中润四方信息技术有限公司西安分公司 A kind of user behavior analysis method and system based on stream calculation
US20180191759A1 (en) * 2017-01-04 2018-07-05 American Express Travel Related Services Company, Inc. Systems and methods for modeling and monitoring data access behavior
CN109271420A (en) * 2018-09-03 2019-01-25 平安医疗健康管理股份有限公司 Information-pushing method, device, computer equipment and storage medium
CN109492772A (en) * 2018-11-28 2019-03-19 北京百度网讯科技有限公司 The method and apparatus for generating information
CN110569906A (en) * 2019-09-10 2019-12-13 京东数字科技控股有限公司 Data processing method, data processing apparatus, and computer-readable storage medium
CN110675263A (en) * 2019-09-27 2020-01-10 支付宝(杭州)信息技术有限公司 Risk identification method and device for transaction data
CN110795570A (en) * 2019-10-11 2020-02-14 上海上湖信息技术有限公司 Method and device for extracting user time sequence behavior characteristics
CN110825969A (en) * 2019-11-07 2020-02-21 腾讯科技(深圳)有限公司 Data processing method, device, terminal and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316198A (en) * 2016-04-26 2017-11-03 阿里巴巴集团控股有限公司 Account risk identification method and device
US20180191759A1 (en) * 2017-01-04 2018-07-05 American Express Travel Related Services Company, Inc. Systems and methods for modeling and monitoring data access behavior
CN107944059A (en) * 2017-12-29 2018-04-20 深圳市中润四方信息技术有限公司西安分公司 A kind of user behavior analysis method and system based on stream calculation
CN109271420A (en) * 2018-09-03 2019-01-25 平安医疗健康管理股份有限公司 Information-pushing method, device, computer equipment and storage medium
CN109492772A (en) * 2018-11-28 2019-03-19 北京百度网讯科技有限公司 The method and apparatus for generating information
CN110569906A (en) * 2019-09-10 2019-12-13 京东数字科技控股有限公司 Data processing method, data processing apparatus, and computer-readable storage medium
CN110675263A (en) * 2019-09-27 2020-01-10 支付宝(杭州)信息技术有限公司 Risk identification method and device for transaction data
CN110795570A (en) * 2019-10-11 2020-02-14 上海上湖信息技术有限公司 Method and device for extracting user time sequence behavior characteristics
CN110825969A (en) * 2019-11-07 2020-02-21 腾讯科技(深圳)有限公司 Data processing method, device, terminal and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064440A (en) * 2022-01-18 2022-02-18 恒生电子股份有限公司 Training method of credibility analysis model, credibility analysis method and related device

Similar Documents

Publication Publication Date Title
CN112015859A (en) Text knowledge hierarchy extraction method and device, computer equipment and readable medium
CN113435582B (en) Text processing method and related equipment based on sentence vector pre-training model
CN114298050A (en) Model training method, entity relation extraction method, device, medium and equipment
CN111639500A (en) Semantic role labeling method and device, computer equipment and storage medium
CN117093477A (en) Software quality assessment method and device, computer equipment and storage medium
WO2023040145A1 (en) Artificial intelligence-based text classification method and apparatus, electronic device, and medium
CN116523284A (en) Automatic evaluation method and system for business operation flow based on machine learning
CN116881430A (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN113204698B (en) News subject term generation method, device, equipment and medium
CN112036439B (en) Dependency relationship classification method and related equipment
CN113256181A (en) Risk factor prediction method, device, equipment and medium
CN111949867A (en) Cross-APP user behavior analysis model training method, analysis method and related equipment
CN112395401A (en) Adaptive negative sample pair sampling method and device, electronic equipment and storage medium
CN112347776B (en) Medical data processing method and device, storage medium and electronic equipment
CN116843395A (en) Alarm classification method, device, equipment and storage medium of service system
CN115186738B (en) Model training method, device and storage medium
CN111681730A (en) Method for analyzing medical image report and computer-readable storage medium
CN116108276A (en) Information recommendation method and device based on artificial intelligence and related equipment
CN115718807A (en) Personnel relationship analysis method, device, equipment and storage medium
CN111597224B (en) Method and device for generating structured information, electronic equipment and storage medium
CN111666754B (en) Entity identification method and system based on electronic disease text and computer equipment
CN113627514A (en) Data processing method and device of knowledge graph, electronic equipment and storage medium
CN112182069B (en) Agent retention prediction method, agent retention prediction device, computer equipment and storage medium
CN117172632B (en) Enterprise abnormal behavior detection method, device, equipment and storage medium
CN113723554B (en) Model scheduling method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination