CN110737700A

CN110737700A - purchase, sales and inventory user classification method and system based on Bayesian algorithm

Info

Publication number: CN110737700A
Application number: CN201910983525.4A
Authority: CN
Inventors: 刘天水; 王正宇
Original assignee: Zhuo Zhuo Network Technology Co Ltd
Current assignee: Zhuo Zhuo Network Technology Co Ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-01-31

Abstract

The invention discloses an purchase, sales and inventory user classification method based on a Bayesian algorithm, which is characterized by comprising the steps of 1 confirming characteristic attributes, 2 training a naive Bayes classifier, 3 counting classification probabilities, calculating the occurrence probability of each characteristic attribute under each user type, 4 actually judging the belonged classification of a certain user, 5 durably storing training, classification and the like which cannot be completed in conversations, achieving the naive Bayes classifier for carrying out probability calculation on independent users, avoiding the process that an operation part needs to re-analyze user behaviors according to a chart, reducing labor cost, obtaining P (B | A) through calculation of P (A | B) formed by the probability of a category A to which the user belongs and the probability of a user document B, namely the use probability of a certain user on corresponding function points, macroscopically showing the points of the user on system functions, and providing decision support effect for subsequent system development.

Description

purchase, sales and inventory user classification method and system based on Bayesian algorithm

Technical Field

The invention relates to the field of data mining, in particular to purchase, sales and inventory user classification methods and systems based on a Bayesian algorithm.

Background

With the increase of the number of users in the website, the service data formed by the users exponentially increases, and the cost for collecting effective data also increases. Excessive information requires operators of the websites to spend a great deal of human resources to dig out effective information from the mass data. By collecting the record file of the purchase-sale-stock website user as data, the operation behavior path of the user is analyzed on the basis of the data, the identity type, the industry and the like of the user are confirmed, the attention point of the user on the existing system function is known, and the use viscosity of the user on the system is improved.

Data mining is used as a hotspot research direction, and the collection requirement can be efficiently and accurately met. Through research of relevant documents, the user data are converted into analysis texts by familiar concepts and evaluation methods for classifying various user behaviors. Meanwhile, the advantages and disadvantages of various classification algorithms are compared, and the Bayesian algorithm is selected as a main application method to classify the user behavior types.

In the traditional solution, database data is synchronously produced, SQL scripts are used for cleaning and filtering the data, and data of the same theme type are aggregated, so that the development period is long, the cost is high, and the data redundancy is high. The obtained analysis results are mostly shown in a form of a table and a chart, and although results such as a cyclic ratio, an average value and the like of data in a specified range in a period are easy to calculate, the accuracy of user behavior change prediction is not ideal, manual analysis is needed, a large amount of work is occupied, and the maintenance cost is high.

The user-based behavior data is converted into analysis texts, and parts are extracted to serve as training sets. The probability of each user appearing in each classification is combined, the probability of the whole user for the classification is obtained, and a proper classifier is designed for the computer to independently learn, so that the classification work is actively completed, the manual intervention is not needed, and the classification efficiency and accuracy are improved.

The existing K-approach algorithm is used for realizing similar functions, although the precision is high, the computational complexity and the space complexity are also too high, and the data range of each sample cannot be ensured to be in the same orders of magnitude, so that the K-approach algorithm is not suitable for use.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides purchase, sales and inventory user classification methods and systems based on a Bayesian algorithm.

By converting the service data into the user document information, the user document is preprocessed, special data and meaningless information are filtered, and the accuracy of sample data is improved. The data analysis and development process is simplified, a more visual user classification result is formed, and the cost input of secondary personnel in the aspect of user classification in operation is reduced.

Before the Bayesian algorithm is used, the user document needs to be preprocessed, meaningless information in the user document is filtered, a training sample text information sequence is obtained, preprocessing operation is further performed in step , a set text representation model is formed, and feature selection and feature weight calculation work is provided.

In order to solve the technical problem, the invention provides marketing-inventory user classification methods based on Bayesian algorithm, which is characterized by comprising the following steps:

step 1: confirming the characteristic attribute; the characteristic attributes refer to data characteristics corresponding to different types of purchase, sales and inventory users, data in the relational database are extracted and integrated, the characteristic attributes of the different types of purchase, sales and inventory users are analyzed, and user documents in a key value pair form, namely the user documents for analysis, are formed;

step 2, training a classifier, compiling classes representing the classifier according to Bayesian theorem P (A ∩ B) ═ P (A) ═ P (B | A) ═ P (B) · P (A | B), and encapsulating information to form a plurality of classifier examples, and responding the requirements of different types of groups by training the classifier examples, wherein the classifier is a naive Bayes classifier;

and step 3: counting the classification probability and calculating the probability of the occurrence of each characteristic attribute under each user type; carrying out probability statistics on the times of classification results, using a Bayesian algorithm to carry out exchange solution on conditional probability, setting the sample as x, and recording a single characteristic attribute as a and all characteristic attributes as a₁～a_mThen x ═ a₁,a₂,…a_mIs the set of characteristic attributes of x to be classified, C ═ y₁,y₂,…y_mIs a type set; calculating P (A | B1) P (B1) + P (A | B2) + … + P (A | Bn) P (Bn) for each type to obtain

And 4, step 4: actually judging the belonged classification of a certain user, namely calculating the probability of the user in each different classification, and selecting a term P (A | Bn) P (Bn) with the maximum probability as the belonged classification of the user;

step 5, training and classification that cannot be completed in sessions is persisted.

In step 1, the characteristic attributes specifically include:

the characteristic attributes of the lost user are as follows: the record list exists in the year but does not exist in 30 days, and the quantity of unread information of the user home page exceeds 3;

the characteristic attribute of the user as a worthless registered user is that the ratio of the login times in 30 days to the actually generated valid documents is below 10 percent, and any complete service flows in the system are not formed by the creation of the documents;

the characteristic attributes of the user as a system buyer user are as follows: the sum of the purchase singular number, the payment singular number and the invoicing singular number accounts for more than 80% of the total record singular number, namely (purchase singular number + payment singular number + invoicing singular number)/total record singular number > 80%;

in step 1, the analyzing the characteristic attributes of the different types of purchase, sales and inventory users includes: the redundant characteristic attribute is eliminated by adopting a mode of calculating the variance of the characteristic, specifically, the ith characteristic is marked as x_iThe total number of features is n, the variance calculation result is s, then

And setting a threshold value d as a criterion for judging the characteristic according to the service condition, and if s is larger than or equal to d, keeping the characteristic attribute.

In the step 4, a TF-IDF algorithm is used for distributing weights to the characteristic attributes of the same user type, and the weight of each characteristic attribute is recorded as k_jIf the feature probabilities cannot meet the calculation accuracy requirement under user types, step is performed to mark each of the different feature attributesThe self-occupation weight, the occupation weight of each characteristic attribute is reduced along with the increase of the frequency of the characteristic attributes of all user types, but is increased along with the increase of the times of the characteristic attributes of single user types, and the TF-IDF algorithm is specifically as follows:

TF－IDF＝TF*IDF；

then there is

In step 4, the marked feature attributes include "when it is determined whether the user is lost, the number of unread messages of the user exceeds 20".

The persistent storage is specifically as follows: and compressing and storing the data by using a joblib software package, and directly acquiring the existing data into a memory when training is performed again or the existing data is applied to other types of classifiers.

A classification system based on Bayesian algorithm for purchase, sale and inventory users is characterized by comprising a relational database, a text preprocessing module, a model storage module, an algorithm execution module and a report module which are connected in sequence, wherein the model storage module is connected with the database,

the relational database is a database which adopts a relational model to organize data, stores the data in a row and column mode, particularly is a MySQL database, and is used for storing various forms and operation logs for a business type relational database in a purchase-sale-stock system;

the text preprocessing model is used for processing a data table in a relational database by compiling codes and converting the data table into document type data, and the document type data can be stored persistently and can also be directly participated in subsequent operation in a memory;

the model storage module is used for carrying out persistent storage on the training information of the classifier, and the calculation efficiency can be improved through a persistent storage form on the model, sample data and Bayesian model analysis results which cannot be completed in sessions;

the algorithm execution module is used for selecting a proper classification method after obtaining the prior probability, the posterior probability and the likelihood estimation value of the sample to form an algorithm execution example; the probability of the sample users belonging to the respective type is determined.

And the reporting module is used for reading the result of the algorithm execution module after completing the task execution after completing the probability calculation and visually displaying the classification result in the form of a table and a graph, and the module adopts an incremental updating mode, can freely define a result set structure and is used for visually displaying the result to an operation analyst.

The invention achieves the following beneficial effects:

1. through the Bayes algorithm, a naive Bayes classifier for the independent user to perform probability calculation is realized, the process that the operator needs to re-analyze the user behavior according to the chart can be omitted, and the labor cost is reduced;

2. p (B | A) is also obtained through calculation of P (A | B) formed by the probability of the category A to which the user belongs and the probability of the user document B, namely the use probability of a certain class of users to the corresponding function points macroscopically shows the focus of the users to the system functions, and provides decision support for the subsequent system development focus.

Drawings

FIG. 1 is a schematic flow diagram of a method of an exemplary embodiment of the present invention;

fig. 2 is a schematic diagram of a system architecture in an exemplary embodiment of the invention.

Detailed Description

A purchase-sale-storage user classification method based on Bayesian algorithm includes the following steps:

In step 1, the characteristic attributes specifically include:

In the step 4, a TF-IDF algorithm is used for distributing weights to the characteristic attributes of the same user type, and the weight of each characteristic attribute is recorded as k_jIf the feature probabilities of user types cannot satisfy the calculation accuracy requirement, step is further marked with the respective weights of different feature attributes, where each feature attribute has a weight that decreases with the increase of the occurrence frequency of the feature attribute in all user types but increases with the increase of the occurrence frequency of the feature attribute in a single user type, and the TF-IDF algorithm is specifically:

TF－IDF＝TF*IDF；

then there is

The invention is further described with reference to the figures and the exemplary embodiments:

purchase, sales and inventory user classification methods based on Bayesian algorithm as shown in FIG. 1;

A purchase-sale-stock user classification method based on Bayesian algorithm is characterized by comprising the following steps:

step 101, confirming characteristic attributes and forming a user document, wherein the confirmed characteristic attributes comprise that the characteristic attributes of the user for loss are that the record exists in the year but the record does not exist in 30 days, the unread information of the first page of the user exceeds 3, the characteristic attributes of the user for a non-valuable registered user are that the ratio of the login times in 30 days to the actually generated effective document is below 10%, the creation of the document does not form any complete service flows in a system, the characteristic attributes are (purchasing single number + paying single number + invoicing single number)/total record single number > 80%, the system is judged as a buyer, and the user document capable of being analyzed is formed by the characteristic attributes.

Analyzing the characteristic attributes of the purchase-sale-stock user in the step 101, the occupation ratio among the input quantities of various documents, the number of effective commodities and other information, performing a great deal of balance when selecting the characteristic set, and analyzing whether the characteristic can obtain a correct conclusion. Redundant feature attributes may be generally excluded in the form of computing feature variances. Let the ith characteristic attribute be x_iIf the total number of the feature attributes is n and the variance calculation result is s, then there are

And setting a threshold value d as a criterion for judging the characteristic according to the service condition, and if s is larger than or equal to d, keeping the characteristic attribute. If the variance calculated by "the top page unread information exceeds 3" is 2 and d is set to 1, the feature is a valid feature. By excluding redundant feature attributes in this way, the more valid samples that are obtained, the better the computational effect, and the continuous adjustment and optimization is performed on the basis of the more valid samples. And extracting and integrating the data in the relational database to form a user document in a key value pair form for participating in analysis.

Step 102: selecting and training an applicable classifier model; compiling a class representing a classifier, and packaging the grasped information; respective classifier instances may be formed for a plurality of different users, types or queries; they are trained to respond to the needs of different types of groups. The naive Bayes classifier model is used, similar classifier models comprise a K-approach algorithm, a decision tree model and the like, and the decision tree model is not selected because the user type of the system has too large proportion difference, warehouse management and buyers can possibly occupy most parts of the system, the gain of the calculation result is biased to the characteristics of the type due to too many samples, the deviation of the calculation result is too large, and the K-approach also has the defect of overfitting due to too small K value caused by large deviation of the number of samples, so that the naive Bayes classifier model is not suitable for the system. The daily incoming list quantity, the operation log data and the like of the purchase-sale-storage user belong to discrete data, and the system does not have large data volume in the initial stage, so the method is more suitable for a naive Bayes classifier.

Step 103: and counting the classification probability, and calculating the probability of each feature under each type. Obtaining training samples of purchase-sale-stock users, setting the samples as x, and recording a single characteristic attribute as a and all characteristic attributes as a₁～a_mThen x ═ a₁,a₂,…a_m}. Calculating the probability that a certain user in the sample belongs to an unproductive user, and then setting the type set C as { y ═ y₁(value) y₂(worthless) }, assuming 800 out of 10000 user samples are worthless users, i.e. P (C ═ y) can be written₁)＝92％,P(C＝y₂) Not more than 8%; feature probability P (effective entry number)<Total number of entries 10% | C ═ y₂) The occurrence frequency is 40, the occupied probability is 0.4 percent,p (effective entry number)<Total number of entries 20% | C ═ y₂) The occurrence frequency is 80 percent, the occupied probability is 0.8 percent, and the like to obtain

Wherein x ═ { a ═ a₁,a₂,…a_mIs the set of characteristic attributes of x to be classified, C ═ y₁,y₂,…y_mIs a type set.

Step 104, using TF-IDF algorithm to assign weight to each feature under the same type, in order to fully consider the influence degree of different feature attributes on the classification result on the basis of feature selection, steps are needed to mark the weight occupied by each feature attribute, the weight occupied by each feature attribute decreases with the increase of the frequency of occurrence of the feature attributes in all user types, but increases with the increase of the frequency of occurrence of the feature attributes in a single user type, for example, "the number of unread messages per day exceeds 20 in weeks, but the function fails in the week, in this case, more than 20 unread messages cannot be eliminated in the feature selection process, but the weight occupied by the unread messages is correspondingly reduced because of system failure, or" for both user types of purchasing 'and' warehouse ', the number of warehouse accounts for more than 80% of the total number of the feature attributes of both, but the number of warehouse documents is higher than 80% of the feature attributes when the warehouse documents are classified in daily warehouse, the warehouse is the number of warehouse documents of the user types of the warehouse 80% of the classification result, if the warehouse is the warehouse' is classified documents of the warehouse 'k' (the final classification result of the classification result is generated by the TF-IDF algorithm), the final classification result is assigned to the final classification result of the TF-80% of the classification result of the warehouse (the classification result of the warehouse) (,

TF－IDF＝TF*IDF

each feature attribute weight is denoted as k_jThen there is

The influence of each characteristic attribute on the final classification result can be adjusted by the formula, and the important characteristic attributes of respective types are more prone to be used as calculation parameters with higher weight.

And 105, performing persistent storage on training, classification and the like which cannot be completed in sessions, compressing and storing the data by using a joblib software package and related tools, and directly acquiring the existing data into a memory when the training is performed again or the existing data is applied to other types of classifiers.

The purchase, sales and inventory user classification system based on the Bayesian algorithm shown in FIG. 2 comprises a relational database, a text preprocessing module, a model storage module, an algorithm execution module and a report module which are connected in sequence, wherein the model storage module is connected with the database,

the relational database is a database which adopts a relational model to organize data, stores the data in the form of rows and columns for user understanding, series of rows and columns of the relational database are called as tables, groups of tables form the database, for example, the MySQL database can be used in a purchase-sale-storage system as a business-type relational database and provides storage of various forms and operation logs for users;

And the report module is used for reading the result of the algorithm execution module after the execution task is finished and displaying the result in forms of tables, graphs and the like. The module can adopt an incremental updating mode and freely define a result set structure so as to visually display the result to operation analysts, help the analysts quickly define the user role type and reduce the cost of manually analyzing the user type.

The invention achieves the following beneficial effects:

The above embodiments do not limit the present invention in any way, and all other modifications and applications that can be made to the above embodiments in equivalent ways are within the scope of the present invention.

Claims

1, A purchase-sale-save user classification method based on Bayesian algorithm, characterized by comprising the following steps:

2. The purchase-sale-save user classification method based on the Bayesian algorithm as claimed in claim 1, wherein in said step 1, said characteristic attributes specifically include:

3. The marketing and inventory user classification method based on Bayesian algorithm as claimed in claim 2, wherein in step 4, TF-IDF algorithm is used to assign weight to each feature attribute under the same user type, and each feature attribute weight is recorded as k_jIf the feature probabilities of user types cannot satisfy the calculation accuracy requirement, step is further marked with the respective weights of different feature attributes, where each feature attribute has a weight that decreases with the increase of the occurrence frequency of the feature attribute in all user types but increases with the increase of the occurrence frequency of the feature attribute in a single user type, and the TF-IDF algorithm is specifically:

TF－IDF＝TF*IDF；

then there is

4. The method for classifying purchase, sale and inventory users based on Bayesian algorithm, as claimed in claim 3, wherein said labeled feature attributes in step 4 include "when the number of unread messages of a user exceeds 20 when it is determined that the user is lost".

5. The Bayesian algorithm-based purchase-sale-stock user classification method as claimed in claim 4, wherein the persistent storage is implemented by compressing and storing data using a joblib software package, and directly acquiring the existing data into a memory when training is performed again or the existing data is applied to other types of classifiers.

6, A purchase, sale and inventory user classification system based on Bayesian algorithm operated according to of claims 1-5, characterized by comprising a relational database, a text preprocessing module, a model storage module, an algorithm execution module and a report module connected in sequence, wherein the model storage module is connected with the database,

the algorithm execution module is used for selecting a proper classification method after obtaining the prior probability, the posterior probability and the likelihood estimation value of the sample to form an algorithm execution example; determining the probability of the sample users belonging to each type;