CN110737700A - purchase, sales and inventory user classification method and system based on Bayesian algorithm - Google Patents

purchase, sales and inventory user classification method and system based on Bayesian algorithm Download PDF

Info

Publication number
CN110737700A
CN110737700A CN201910983525.4A CN201910983525A CN110737700A CN 110737700 A CN110737700 A CN 110737700A CN 201910983525 A CN201910983525 A CN 201910983525A CN 110737700 A CN110737700 A CN 110737700A
Authority
CN
China
Prior art keywords
user
probability
classification
characteristic
purchase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910983525.4A
Other languages
Chinese (zh)
Inventor
刘天水
王正宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuo Zhuo Network Technology Co Ltd
Original Assignee
Zhuo Zhuo Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuo Zhuo Network Technology Co Ltd filed Critical Zhuo Zhuo Network Technology Co Ltd
Priority to CN201910983525.4A priority Critical patent/CN110737700A/en
Publication of CN110737700A publication Critical patent/CN110737700A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an purchase, sales and inventory user classification method based on a Bayesian algorithm, which is characterized by comprising the steps of 1 confirming characteristic attributes, 2 training a naive Bayes classifier, 3 counting classification probabilities, calculating the occurrence probability of each characteristic attribute under each user type, 4 actually judging the belonged classification of a certain user, 5 durably storing training, classification and the like which cannot be completed in conversations, achieving the naive Bayes classifier for carrying out probability calculation on independent users, avoiding the process that an operation part needs to re-analyze user behaviors according to a chart, reducing labor cost, obtaining P (B | A) through calculation of P (A | B) formed by the probability of a category A to which the user belongs and the probability of a user document B, namely the use probability of a certain user on corresponding function points, macroscopically showing the points of the user on system functions, and providing decision support effect for subsequent system development.

Description

purchase, sales and inventory user classification method and system based on Bayesian algorithm
Technical Field
The invention relates to the field of data mining, in particular to purchase, sales and inventory user classification methods and systems based on a Bayesian algorithm.
Background
With the increase of the number of users in the website, the service data formed by the users exponentially increases, and the cost for collecting effective data also increases. Excessive information requires operators of the websites to spend a great deal of human resources to dig out effective information from the mass data. By collecting the record file of the purchase-sale-stock website user as data, the operation behavior path of the user is analyzed on the basis of the data, the identity type, the industry and the like of the user are confirmed, the attention point of the user on the existing system function is known, and the use viscosity of the user on the system is improved.
Data mining is used as a hotspot research direction, and the collection requirement can be efficiently and accurately met. Through research of relevant documents, the user data are converted into analysis texts by familiar concepts and evaluation methods for classifying various user behaviors. Meanwhile, the advantages and disadvantages of various classification algorithms are compared, and the Bayesian algorithm is selected as a main application method to classify the user behavior types.
In the traditional solution, database data is synchronously produced, SQL scripts are used for cleaning and filtering the data, and data of the same theme type are aggregated, so that the development period is long, the cost is high, and the data redundancy is high. The obtained analysis results are mostly shown in a form of a table and a chart, and although results such as a cyclic ratio, an average value and the like of data in a specified range in a period are easy to calculate, the accuracy of user behavior change prediction is not ideal, manual analysis is needed, a large amount of work is occupied, and the maintenance cost is high.
The user-based behavior data is converted into analysis texts, and parts are extracted to serve as training sets. The probability of each user appearing in each classification is combined, the probability of the whole user for the classification is obtained, and a proper classifier is designed for the computer to independently learn, so that the classification work is actively completed, the manual intervention is not needed, and the classification efficiency and accuracy are improved.
The existing K-approach algorithm is used for realizing similar functions, although the precision is high, the computational complexity and the space complexity are also too high, and the data range of each sample cannot be ensured to be in the same orders of magnitude, so that the K-approach algorithm is not suitable for use.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides purchase, sales and inventory user classification methods and systems based on a Bayesian algorithm.
By converting the service data into the user document information, the user document is preprocessed, special data and meaningless information are filtered, and the accuracy of sample data is improved. The data analysis and development process is simplified, a more visual user classification result is formed, and the cost input of secondary personnel in the aspect of user classification in operation is reduced.
Before the Bayesian algorithm is used, the user document needs to be preprocessed, meaningless information in the user document is filtered, a training sample text information sequence is obtained, preprocessing operation is further performed in step , a set text representation model is formed, and feature selection and feature weight calculation work is provided.
In order to solve the technical problem, the invention provides marketing-inventory user classification methods based on Bayesian algorithm, which is characterized by comprising the following steps:
step 1: confirming the characteristic attribute; the characteristic attributes refer to data characteristics corresponding to different types of purchase, sales and inventory users, data in the relational database are extracted and integrated, the characteristic attributes of the different types of purchase, sales and inventory users are analyzed, and user documents in a key value pair form, namely the user documents for analysis, are formed;
step 2, training a classifier, compiling classes representing the classifier according to Bayesian theorem P (A ∩ B) ═ P (A) ═ P (B | A) ═ P (B) · P (A | B), and encapsulating information to form a plurality of classifier examples, and responding the requirements of different types of groups by training the classifier examples, wherein the classifier is a naive Bayes classifier;
and step 3: counting the classification probability and calculating the probability of the occurrence of each characteristic attribute under each user type; carrying out probability statistics on the times of classification results, using a Bayesian algorithm to carry out exchange solution on conditional probability, setting the sample as x, and recording a single characteristic attribute as a and all characteristic attributes as a1~amThen x ═ a1,a2,…amIs the set of characteristic attributes of x to be classified, C ═ y1,y2,…ymIs a type set; calculating P (A | B1) P (B1) + P (A | B2) + … + P (A | Bn) P (Bn) for each type to obtain
And 4, step 4: actually judging the belonged classification of a certain user, namely calculating the probability of the user in each different classification, and selecting a term P (A | Bn) P (Bn) with the maximum probability as the belonged classification of the user;
step 5, training and classification that cannot be completed in sessions is persisted.
In step 1, the characteristic attributes specifically include:
the characteristic attributes of the lost user are as follows: the record list exists in the year but does not exist in 30 days, and the quantity of unread information of the user home page exceeds 3;
the characteristic attribute of the user as a worthless registered user is that the ratio of the login times in 30 days to the actually generated valid documents is below 10 percent, and any complete service flows in the system are not formed by the creation of the documents;
the characteristic attributes of the user as a system buyer user are as follows: the sum of the purchase singular number, the payment singular number and the invoicing singular number accounts for more than 80% of the total record singular number, namely (purchase singular number + payment singular number + invoicing singular number)/total record singular number > 80%;
in step 1, the analyzing the characteristic attributes of the different types of purchase, sales and inventory users includes: the redundant characteristic attribute is eliminated by adopting a mode of calculating the variance of the characteristic, specifically, the ith characteristic is marked as xiThe total number of features is n, the variance calculation result is s, then
Figure BDA0002235988370000022
And setting a threshold value d as a criterion for judging the characteristic according to the service condition, and if s is larger than or equal to d, keeping the characteristic attribute.
In the step 4, a TF-IDF algorithm is used for distributing weights to the characteristic attributes of the same user type, and the weight of each characteristic attribute is recorded as kjIf the feature probabilities cannot meet the calculation accuracy requirement under user types, step is performed to mark each of the different feature attributesThe self-occupation weight, the occupation weight of each characteristic attribute is reduced along with the increase of the frequency of the characteristic attributes of all user types, but is increased along with the increase of the times of the characteristic attributes of single user types, and the TF-IDF algorithm is specifically as follows:
TF-IDF=TF*IDF;
then there is
Figure BDA0002235988370000033
In step 4, the marked feature attributes include "when it is determined whether the user is lost, the number of unread messages of the user exceeds 20".
The persistent storage is specifically as follows: and compressing and storing the data by using a joblib software package, and directly acquiring the existing data into a memory when training is performed again or the existing data is applied to other types of classifiers.
A classification system based on Bayesian algorithm for purchase, sale and inventory users is characterized by comprising a relational database, a text preprocessing module, a model storage module, an algorithm execution module and a report module which are connected in sequence, wherein the model storage module is connected with the database,
the relational database is a database which adopts a relational model to organize data, stores the data in a row and column mode, particularly is a MySQL database, and is used for storing various forms and operation logs for a business type relational database in a purchase-sale-stock system;
the text preprocessing model is used for processing a data table in a relational database by compiling codes and converting the data table into document type data, and the document type data can be stored persistently and can also be directly participated in subsequent operation in a memory;
the model storage module is used for carrying out persistent storage on the training information of the classifier, and the calculation efficiency can be improved through a persistent storage form on the model, sample data and Bayesian model analysis results which cannot be completed in sessions;
the algorithm execution module is used for selecting a proper classification method after obtaining the prior probability, the posterior probability and the likelihood estimation value of the sample to form an algorithm execution example; the probability of the sample users belonging to the respective type is determined.
And the reporting module is used for reading the result of the algorithm execution module after completing the task execution after completing the probability calculation and visually displaying the classification result in the form of a table and a graph, and the module adopts an incremental updating mode, can freely define a result set structure and is used for visually displaying the result to an operation analyst.
The invention achieves the following beneficial effects:
1. through the Bayes algorithm, a naive Bayes classifier for the independent user to perform probability calculation is realized, the process that the operator needs to re-analyze the user behavior according to the chart can be omitted, and the labor cost is reduced;
2. p (B | A) is also obtained through calculation of P (A | B) formed by the probability of the category A to which the user belongs and the probability of the user document B, namely the use probability of a certain class of users to the corresponding function points macroscopically shows the focus of the users to the system functions, and provides decision support for the subsequent system development focus.
Drawings
FIG. 1 is a schematic flow diagram of a method of an exemplary embodiment of the present invention;
fig. 2 is a schematic diagram of a system architecture in an exemplary embodiment of the invention.
Detailed Description
A purchase-sale-storage user classification method based on Bayesian algorithm includes the following steps:
step 1: confirming the characteristic attribute; the characteristic attributes refer to data characteristics corresponding to different types of purchase, sales and inventory users, data in the relational database are extracted and integrated, the characteristic attributes of the different types of purchase, sales and inventory users are analyzed, and user documents in a key value pair form, namely the user documents for analysis, are formed;
step 2, training a classifier, compiling classes representing the classifier according to Bayesian theorem P (A ∩ B) ═ P (A) ═ P (B | A) ═ P (B) · P (A | B), and encapsulating information to form a plurality of classifier examples, and responding the requirements of different types of groups by training the classifier examples, wherein the classifier is a naive Bayes classifier;
and step 3: counting the classification probability and calculating the probability of the occurrence of each characteristic attribute under each user type; carrying out probability statistics on the times of classification results, using a Bayesian algorithm to carry out exchange solution on conditional probability, setting the sample as x, and recording a single characteristic attribute as a and all characteristic attributes as a1~amThen x ═ a1,a2,…amIs the set of characteristic attributes of x to be classified, C ═ y1,y2,…ymIs a type set; calculating P (A | B1) P (B1) + P (A | B2) + … + P (A | Bn) P (Bn) for each type to obtain
Figure BDA0002235988370000041
And 4, step 4: actually judging the belonged classification of a certain user, namely calculating the probability of the user in each different classification, and selecting a term P (A | Bn) P (Bn) with the maximum probability as the belonged classification of the user;
step 5, training and classification that cannot be completed in sessions is persisted.
In step 1, the characteristic attributes specifically include:
the characteristic attributes of the lost user are as follows: the record list exists in the year but does not exist in 30 days, and the quantity of unread information of the user home page exceeds 3;
the characteristic attribute of the user as a worthless registered user is that the ratio of the login times in 30 days to the actually generated valid documents is below 10 percent, and any complete service flows in the system are not formed by the creation of the documents;
the characteristic attributes of the user as a system buyer user are as follows: the sum of the purchase singular number, the payment singular number and the invoicing singular number accounts for more than 80% of the total record singular number, namely (purchase singular number + payment singular number + invoicing singular number)/total record singular number > 80%;
in step 1, the analyzing the characteristic attributes of the different types of purchase, sales and inventory users includes: the redundant characteristic attribute is eliminated by adopting a mode of calculating the variance of the characteristic, specifically, the ith characteristic is marked as xiThe total number of features is n, the variance calculation result is s, then
Figure BDA0002235988370000051
And setting a threshold value d as a criterion for judging the characteristic according to the service condition, and if s is larger than or equal to d, keeping the characteristic attribute.
In the step 4, a TF-IDF algorithm is used for distributing weights to the characteristic attributes of the same user type, and the weight of each characteristic attribute is recorded as kjIf the feature probabilities of user types cannot satisfy the calculation accuracy requirement, step is further marked with the respective weights of different feature attributes, where each feature attribute has a weight that decreases with the increase of the occurrence frequency of the feature attribute in all user types but increases with the increase of the occurrence frequency of the feature attribute in a single user type, and the TF-IDF algorithm is specifically:
Figure BDA0002235988370000052
Figure BDA0002235988370000053
TF-IDF=TF*IDF;
then there is
In step 4, the marked feature attributes include "when it is determined whether the user is lost, the number of unread messages of the user exceeds 20".
The persistent storage is specifically as follows: and compressing and storing the data by using a joblib software package, and directly acquiring the existing data into a memory when training is performed again or the existing data is applied to other types of classifiers.
The invention is further described with reference to the figures and the exemplary embodiments:
purchase, sales and inventory user classification methods based on Bayesian algorithm as shown in FIG. 1;
A purchase-sale-stock user classification method based on Bayesian algorithm is characterized by comprising the following steps:
step 101, confirming characteristic attributes and forming a user document, wherein the confirmed characteristic attributes comprise that the characteristic attributes of the user for loss are that the record exists in the year but the record does not exist in 30 days, the unread information of the first page of the user exceeds 3, the characteristic attributes of the user for a non-valuable registered user are that the ratio of the login times in 30 days to the actually generated effective document is below 10%, the creation of the document does not form any complete service flows in a system, the characteristic attributes are (purchasing single number + paying single number + invoicing single number)/total record single number > 80%, the system is judged as a buyer, and the user document capable of being analyzed is formed by the characteristic attributes.
Analyzing the characteristic attributes of the purchase-sale-stock user in the step 101, the occupation ratio among the input quantities of various documents, the number of effective commodities and other information, performing a great deal of balance when selecting the characteristic set, and analyzing whether the characteristic can obtain a correct conclusion. Redundant feature attributes may be generally excluded in the form of computing feature variances. Let the ith characteristic attribute be xiIf the total number of the feature attributes is n and the variance calculation result is s, then there are
Figure BDA0002235988370000061
And setting a threshold value d as a criterion for judging the characteristic according to the service condition, and if s is larger than or equal to d, keeping the characteristic attribute. If the variance calculated by "the top page unread information exceeds 3" is 2 and d is set to 1, the feature is a valid feature. By excluding redundant feature attributes in this way, the more valid samples that are obtained, the better the computational effect, and the continuous adjustment and optimization is performed on the basis of the more valid samples. And extracting and integrating the data in the relational database to form a user document in a key value pair form for participating in analysis.
Step 102: selecting and training an applicable classifier model; compiling a class representing a classifier, and packaging the grasped information; respective classifier instances may be formed for a plurality of different users, types or queries; they are trained to respond to the needs of different types of groups. The naive Bayes classifier model is used, similar classifier models comprise a K-approach algorithm, a decision tree model and the like, and the decision tree model is not selected because the user type of the system has too large proportion difference, warehouse management and buyers can possibly occupy most parts of the system, the gain of the calculation result is biased to the characteristics of the type due to too many samples, the deviation of the calculation result is too large, and the K-approach also has the defect of overfitting due to too small K value caused by large deviation of the number of samples, so that the naive Bayes classifier model is not suitable for the system. The daily incoming list quantity, the operation log data and the like of the purchase-sale-storage user belong to discrete data, and the system does not have large data volume in the initial stage, so the method is more suitable for a naive Bayes classifier.
Step 103: and counting the classification probability, and calculating the probability of each feature under each type. Obtaining training samples of purchase-sale-stock users, setting the samples as x, and recording a single characteristic attribute as a and all characteristic attributes as a1~amThen x ═ a1,a2,…am}. Calculating the probability that a certain user in the sample belongs to an unproductive user, and then setting the type set C as { y ═ y1(value) y2(worthless) }, assuming 800 out of 10000 user samples are worthless users, i.e. P (C ═ y) can be written1)=92%,P(C=y2) Not more than 8%; feature probability P (effective entry number)<Total number of entries 10% | C ═ y2) The occurrence frequency is 40, the occupied probability is 0.4 percent,p (effective entry number)<Total number of entries 20% | C ═ y2) The occurrence frequency is 80 percent, the occupied probability is 0.8 percent, and the like to obtain
Figure BDA0002235988370000071
Wherein x ═ { a ═ a1,a2,…amIs the set of characteristic attributes of x to be classified, C ═ y1,y2,…ymIs a type set.
Step 104, using TF-IDF algorithm to assign weight to each feature under the same type, in order to fully consider the influence degree of different feature attributes on the classification result on the basis of feature selection, steps are needed to mark the weight occupied by each feature attribute, the weight occupied by each feature attribute decreases with the increase of the frequency of occurrence of the feature attributes in all user types, but increases with the increase of the frequency of occurrence of the feature attributes in a single user type, for example, "the number of unread messages per day exceeds 20 in weeks, but the function fails in the week, in this case, more than 20 unread messages cannot be eliminated in the feature selection process, but the weight occupied by the unread messages is correspondingly reduced because of system failure, or" for both user types of purchasing 'and' warehouse ', the number of warehouse accounts for more than 80% of the total number of the feature attributes of both, but the number of warehouse documents is higher than 80% of the feature attributes when the warehouse documents are classified in daily warehouse, the warehouse is the number of warehouse documents of the user types of the warehouse 80% of the classification result, if the warehouse is the warehouse' is classified documents of the warehouse 'k' (the final classification result of the classification result is generated by the TF-IDF algorithm), the final classification result is assigned to the final classification result of the TF-80% of the classification result of the warehouse (the classification result of the warehouse) (,
Figure BDA0002235988370000073
TF-IDF=TF*IDF
each feature attribute weight is denoted as kjThen there is
Figure BDA0002235988370000074
The influence of each characteristic attribute on the final classification result can be adjusted by the formula, and the important characteristic attributes of respective types are more prone to be used as calculation parameters with higher weight.
And 105, performing persistent storage on training, classification and the like which cannot be completed in sessions, compressing and storing the data by using a joblib software package and related tools, and directly acquiring the existing data into a memory when the training is performed again or the existing data is applied to other types of classifiers.
The purchase, sales and inventory user classification system based on the Bayesian algorithm shown in FIG. 2 comprises a relational database, a text preprocessing module, a model storage module, an algorithm execution module and a report module which are connected in sequence, wherein the model storage module is connected with the database,
the relational database is a database which adopts a relational model to organize data, stores the data in the form of rows and columns for user understanding, series of rows and columns of the relational database are called as tables, groups of tables form the database, for example, the MySQL database can be used in a purchase-sale-storage system as a business-type relational database and provides storage of various forms and operation logs for users;
the text preprocessing model is used for processing a data table in a relational database by compiling codes and converting the data table into document type data, and the document type data can be stored persistently and can also be directly participated in subsequent operation in a memory;
the model storage module is used for carrying out persistent storage on the training information of the classifier, and the calculation efficiency can be improved through a persistent storage form on the model, sample data and Bayesian model analysis results which cannot be completed in sessions;
the algorithm execution module is used for selecting a proper classification method after obtaining the prior probability, the posterior probability and the likelihood estimation value of the sample to form an algorithm execution example; the probability of the sample users belonging to the respective type is determined.
And the report module is used for reading the result of the algorithm execution module after the execution task is finished and displaying the result in forms of tables, graphs and the like. The module can adopt an incremental updating mode and freely define a result set structure so as to visually display the result to operation analysts, help the analysts quickly define the user role type and reduce the cost of manually analyzing the user type.
The invention achieves the following beneficial effects:
1. through the Bayes algorithm, a naive Bayes classifier for the independent user to perform probability calculation is realized, the process that the operator needs to re-analyze the user behavior according to the chart can be omitted, and the labor cost is reduced;
2. p (B | A) is also obtained through calculation of P (A | B) formed by the probability of the category A to which the user belongs and the probability of the user document B, namely the use probability of a certain class of users to the corresponding function points macroscopically shows the focus of the users to the system functions, and provides decision support for the subsequent system development focus.
The above embodiments do not limit the present invention in any way, and all other modifications and applications that can be made to the above embodiments in equivalent ways are within the scope of the present invention.

Claims (6)

1, A purchase-sale-save user classification method based on Bayesian algorithm, characterized by comprising the following steps:
step 1: confirming the characteristic attribute; the characteristic attributes refer to data characteristics corresponding to different types of purchase, sales and inventory users, data in the relational database are extracted and integrated, the characteristic attributes of the different types of purchase, sales and inventory users are analyzed, and user documents in a key value pair form, namely the user documents for analysis, are formed;
step 2, training a classifier, compiling classes representing the classifier according to Bayesian theorem P (A ∩ B) ═ P (A) ═ P (B | A) ═ P (B) · P (A | B), and encapsulating information to form a plurality of classifier examples, and responding the requirements of different types of groups by training the classifier examples, wherein the classifier is a naive Bayes classifier;
and step 3: counting the classification probability and calculating the probability of the occurrence of each characteristic attribute under each user type; carrying out probability statistics on the times of classification results, using a Bayesian algorithm to carry out exchange solution on conditional probability, setting the sample as x, and recording a single characteristic attribute as a and all characteristic attributes as a1~amThen x ═ a1,a2,…amIs the set of characteristic attributes of x to be classified, C ═ y1,y2,…ymIs a type set; calculating P (A | B1) P (B1) + P (A | B2) + … + P (A | Bn) P (Bn) for each type to obtain
Figure FDA0002235988360000011
And 4, step 4: actually judging the belonged classification of a certain user, namely calculating the probability of the user in each different classification, and selecting a term P (A | Bn) P (Bn) with the maximum probability as the belonged classification of the user;
step 5, training and classification that cannot be completed in sessions is persisted.
2. The purchase-sale-save user classification method based on the Bayesian algorithm as claimed in claim 1, wherein in said step 1, said characteristic attributes specifically include:
the characteristic attributes of the lost user are as follows: the record list exists in the year but does not exist in 30 days, and the quantity of unread information of the user home page exceeds 3;
the characteristic attribute of the user as a worthless registered user is that the ratio of the login times in 30 days to the actually generated valid documents is below 10 percent, and any complete service flows in the system are not formed by the creation of the documents;
the characteristic attributes of the user as a system buyer user are as follows: the sum of the purchase singular number, the payment singular number and the invoicing singular number accounts for more than 80% of the total record singular number, namely (purchase singular number + payment singular number + invoicing singular number)/total record singular number > 80%;
in step 1, the analyzing the characteristic attributes of the different types of purchase, sales and inventory users includes: the redundant characteristic attribute is eliminated by adopting a mode of calculating the variance of the characteristic, specifically, the ith characteristic is marked as xiThe total number of features is n, the variance calculation result is s, then
Figure FDA0002235988360000012
And setting a threshold value d as a criterion for judging the characteristic according to the service condition, and if s is larger than or equal to d, keeping the characteristic attribute.
3. The marketing and inventory user classification method based on Bayesian algorithm as claimed in claim 2, wherein in step 4, TF-IDF algorithm is used to assign weight to each feature attribute under the same user type, and each feature attribute weight is recorded as kjIf the feature probabilities of user types cannot satisfy the calculation accuracy requirement, step is further marked with the respective weights of different feature attributes, where each feature attribute has a weight that decreases with the increase of the occurrence frequency of the feature attribute in all user types but increases with the increase of the occurrence frequency of the feature attribute in a single user type, and the TF-IDF algorithm is specifically:
Figure FDA0002235988360000021
Figure FDA0002235988360000022
TF-IDF=TF*IDF;
then there is
Figure FDA0002235988360000023
4. The method for classifying purchase, sale and inventory users based on Bayesian algorithm, as claimed in claim 3, wherein said labeled feature attributes in step 4 include "when the number of unread messages of a user exceeds 20 when it is determined that the user is lost".
5. The Bayesian algorithm-based purchase-sale-stock user classification method as claimed in claim 4, wherein the persistent storage is implemented by compressing and storing data using a joblib software package, and directly acquiring the existing data into a memory when training is performed again or the existing data is applied to other types of classifiers.
6, A purchase, sale and inventory user classification system based on Bayesian algorithm operated according to of claims 1-5, characterized by comprising a relational database, a text preprocessing module, a model storage module, an algorithm execution module and a report module connected in sequence, wherein the model storage module is connected with the database,
the relational database is a database which adopts a relational model to organize data, stores the data in a row and column mode, particularly is a MySQL database, and is used for storing various forms and operation logs for a business type relational database in a purchase-sale-stock system;
the text preprocessing model is used for processing a data table in a relational database by compiling codes and converting the data table into document type data, and the document type data can be stored persistently and can also be directly participated in subsequent operation in a memory;
the model storage module is used for carrying out persistent storage on the training information of the classifier, and the calculation efficiency can be improved through a persistent storage form on the model, sample data and Bayesian model analysis results which cannot be completed in sessions;
the algorithm execution module is used for selecting a proper classification method after obtaining the prior probability, the posterior probability and the likelihood estimation value of the sample to form an algorithm execution example; determining the probability of the sample users belonging to each type;
and the reporting module is used for reading the result of the algorithm execution module after completing the task execution after completing the probability calculation and visually displaying the classification result in the form of a table and a graph, and the module adopts an incremental updating mode, can freely define a result set structure and is used for visually displaying the result to an operation analyst.
CN201910983525.4A 2019-10-16 2019-10-16 purchase, sales and inventory user classification method and system based on Bayesian algorithm Pending CN110737700A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910983525.4A CN110737700A (en) 2019-10-16 2019-10-16 purchase, sales and inventory user classification method and system based on Bayesian algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910983525.4A CN110737700A (en) 2019-10-16 2019-10-16 purchase, sales and inventory user classification method and system based on Bayesian algorithm

Publications (1)

Publication Number Publication Date
CN110737700A true CN110737700A (en) 2020-01-31

Family

ID=69270098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910983525.4A Pending CN110737700A (en) 2019-10-16 2019-10-16 purchase, sales and inventory user classification method and system based on Bayesian algorithm

Country Status (1)

Country Link
CN (1) CN110737700A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966071A (en) * 2021-02-03 2021-06-15 北京奥鹏远程教育中心有限公司 User feedback information analysis method, device, equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081655A (en) * 2011-01-11 2011-06-01 华北电力大学 Information retrieval method based on Bayesian classification algorithm
CN102956023A (en) * 2012-08-30 2013-03-06 南京信息工程大学 Bayes classification-based method for fusing traditional meteorological data with perception data
CN107391772A (en) * 2017-09-15 2017-11-24 国网四川省电力公司眉山供电公司 A kind of file classification method based on naive Bayesian

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081655A (en) * 2011-01-11 2011-06-01 华北电力大学 Information retrieval method based on Bayesian classification algorithm
CN102956023A (en) * 2012-08-30 2013-03-06 南京信息工程大学 Bayes classification-based method for fusing traditional meteorological data with perception data
CN107391772A (en) * 2017-09-15 2017-11-24 国网四川省电力公司眉山供电公司 A kind of file classification method based on naive Bayesian

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘磊 等: "基于特征加权朴素贝叶斯分类算法的网络用户识别" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966071A (en) * 2021-02-03 2021-06-15 北京奥鹏远程教育中心有限公司 User feedback information analysis method, device, equipment and readable storage medium
CN112966071B (en) * 2021-02-03 2023-09-08 北京奥鹏远程教育中心有限公司 User feedback information analysis method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
US11574204B2 (en) Integrity evaluation of unstructured processes using artificial intelligence (AI) techniques
US11669750B2 (en) System and/or method for generating clean records from imperfect data using model stack(s) including classification model(s) and confidence model(s)
CN108320171B (en) Hot-sold commodity prediction method, system and device
US20210365963A1 (en) Target customer identification method and device, electronic device and medium
EP3121738A1 (en) Data storage extract, transform and load operations for entity and time-based record generation
US9911131B1 (en) Method and system for obtaining leads based on data derived from a variety of sources
CN111444944A (en) Information screening method, device, equipment and storage medium based on decision tree
CN110866782B (en) Customer classification method and system and electronic equipment
CN110852856A (en) Invoice false invoice identification method based on dynamic network representation
CN109871861B (en) System and method for providing coding for target data
US20180203916A1 (en) Data clustering with reduced partial signature matching using key-value storage and retrieval
CN113051291A (en) Work order information processing method, device, equipment and storage medium
CN112541077B (en) Processing method and system for power grid user service evaluation
CN111581193A (en) Data processing method, device, computer system and storage medium
CN112182207B (en) Invoice virtual offset risk assessment method based on keyword extraction and rapid text classification
CN111026870A (en) ICT system fault analysis method integrating text classification and image recognition
CN113420018A (en) User behavior data analysis method, device, equipment and storage medium
CN110737700A (en) purchase, sales and inventory user classification method and system based on Bayesian algorithm
CN110597796B (en) Big data real-time modeling method and system based on full life cycle
US20210097425A1 (en) Human-understandable machine intelligence
CN111625578A (en) Feature extraction method suitable for time sequence data in cultural science and technology fusion field
CN115936748A (en) Business big data analysis method and system
Hanif Applications of data mining techniques for churn prediction and cross-selling in the telecommunications industry
Das et al. A Review of Data Warehousing Using Feature Engineering
Roelands et al. Classifying businesses by economic activity using web-based text mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination